DEV Community

Cover image for I cut my AWS bill by 93% by ditching Fargate for a single Lightsail VM
Tool Mango
Tool Mango

Posted on

I cut my AWS bill by 93% by ditching Fargate for a single Lightsail VM

TL;DR

I built ToolMango, an AI tools directory, on AWS Fargate. The bill came back at $345/mo before traffic. I migrated to a single $12 Lightsail VM in an afternoon and cut costs by 93% while keeping the same Next.js + Postgres + Redis + BullMQ stack alive.

Here's exactly what I changed, what broke, and what I'd do differently.


What ToolMango is (so the cost numbers make sense)

ToolMango is an editorial directory of AI tools. It scores tools on an ROI Score (cost, time-to-value, output quality, free-tier generosity, category fit, reader engagement) and ranks them — before knowing whether the tool has an affiliate program. Tools we don't earn from frequently outrank tools we do.

Tech stack:

  • Next.js 14 App Router
  • Postgres 16
  • Redis (BullMQ for the agent job queue)
  • Anthropic Claude Sonnet for editorial agents (research, SEO sweep, social drafts)
  • A worker process running 5 cron schedules

Pre-revenue. Brand new domain. ~106 tools indexed at the time of writing.

The original Fargate setup

I started on AWS because I had CDK boilerplate from another project. The architecture was over-engineered for a directory site getting zero traffic:

CloudFront → ALB → Fargate (web ×2 tasks, worker ×1)
                ↓
          Aurora Serverless v2 (writer)
          ElastiCache (Redis, t4g.small ×2)
          NAT ×2 (multi-AZ)
          VPC + interface endpoints
          WAF (managed rule sets)
Enter fullscreen mode Exit fullscreen mode

The CDK code is clean. It deploys with one command. It autoscales. It survives an AZ failure. It's exactly what a series-A SaaS would run.

It's also $345/mo for zero users.

What was actually costing money

I broke it down with aws ce get-cost-and-usage and a few aws ecs describe-task-definition calls:

Resource $/mo
Aurora Serverless v2 (no auto-pause, 0.5 ACU min) $86
Fargate ARM64 (3 tasks: 2× web at 1vCPU/2GB + 1× worker at 0.5/1GB) $71
2× NAT Gateways (multi-AZ) $65
VPC interface endpoints (Secrets Manager × 3 AZ + others) $40
ALB + WAF $34
CloudWatch + Container Insights $15
Public IPv4 charges $15
ElastiCache (cache.t4g.small ×2 nodes) $11
Misc (CloudFront, Secrets, Route53, S3) $8

The killer insight: about $87/mo of that bill is "infrastructure plumbing" — NAT, ALB, ElastiCache, VPC endpoints. None of it is doing real work for the application. It's all there to support the architecture itself.

That's the floor on a Fargate setup. For a pre-revenue project, it's nuts.

Phase 1: Skeleton mode on AWS

Before migrating, I tried to make Fargate cheap. CDK changes I shipped:

// Aurora: enable auto-pause when idle
const cfnCluster = cluster.node.defaultChild as rds.CfnDBCluster;
cfnCluster.serverlessV2ScalingConfiguration = {
  minCapacity: 0,        // was 0.5 — auto-pause after 5 min idle
  maxCapacity: 2,        // was 4
  secondsUntilAutoPause: 300,
};

// Network: 1 NAT instead of 2
natGateways: 1,           // was 2 (multi-AZ)

// Web: smaller, fewer tasks, autoscale up if needed
desiredCount: 1,          // was 2
cpu: 512,                 // was 1024
memoryLimitMiB: 1024,     // was 2048

// Worker on Fargate Spot
capacityProviderStrategies: [
  { capacityProvider: "FARGATE_SPOT", weight: 4 },
  { capacityProvider: "FARGATE", weight: 1 },
],

// Container Insights off
containerInsightsV2: ecs.ContainerInsights.DISABLED,

// Backup retention
backup: { retention: cdk.Duration.days(1) },  // was 14

// WAF: removed entirely (CloudFront has free Shield Standard)
Enter fullscreen mode Exit fullscreen mode

Result: $345/mo → ~$140/mo. Better, but still ridiculous for a pre-revenue project.

The reason it stopped at $140: NAT, ALB, ElastiCache, VPC endpoints, and Aurora storage all have hard floors. You can't make Fargate genuinely cheap because the architecture itself isn't designed for cheap.

Phase 2: The honest migration

Lightsail is AWS's "give me a Linux VM and stop overthinking it" tier. $12/mo for 2 vCPU, 2GB RAM, 60GB SSD, 3TB transfer — and it includes a static IP and a firewall.

The plan: run everything on one VM in Docker Compose.

services:
  postgres:
    image: postgres:16-alpine
    volumes: [./data/postgres:/var/lib/postgresql/data]
    deploy: { resources: { limits: { memory: 512M } } }

  redis:
    image: redis:7-alpine
    command: redis-server --appendonly yes --maxmemory 128mb --maxmemory-policy noeviction
    volumes: [./data/redis:/data]
    deploy: { resources: { limits: { memory: 192M } } }

  web:
    image: tm-web:latest
    ports: ["127.0.0.1:3000:3000"]
    env_file: .env
    depends_on:
      postgres: { condition: service_healthy }
      redis: { condition: service_healthy }
    deploy: { resources: { limits: { memory: 768M } } }

  worker:
    image: tm-worker:latest
    env_file: .env
    depends_on:
      postgres: { condition: service_healthy }
      redis: { condition: service_healthy }
    deploy: { resources: { limits: { memory: 384M } } }
Enter fullscreen mode Exit fullscreen mode

For HTTPS termination: Caddy, which auto-issues Let's Encrypt certs on first request. Configuration is one stanza:

toolmango.com, www.toolmango.com {
    reverse_proxy 127.0.0.1:3000
    encode gzip zstd
    header {
        Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
        X-Content-Type-Options "nosniff"
    }
}
Enter fullscreen mode Exit fullscreen mode

Caddy reloads, Caddy gets the cert. Total setup time: 30 seconds.

Migrating Aurora data to local Postgres

Aurora is in a private subnet (PRIVATE_ISOLATED), so I couldn't pg_dump from outside. The workaround: spin up a one-off ECS Fargate task in the existing Web's VPC that runs pg_dump and uploads to S3.

aws ecs run-task \
  --cluster tm-prod-compute \
  --task-definition tm-prod-pgdump \
  --launch-type FARGATE \
  --network-configuration 'awsvpcConfiguration={subnets=[subnet-...],securityGroups=[sg-...],assignPublicIp=DISABLED}'
Enter fullscreen mode Exit fullscreen mode

The task definition uses postgres:16-alpine, installs aws-cli on the fly, runs:

pg_dump --no-owner --no-acl --clean --if-exists -h $DB_HOST -U $DB_USER -d toolmango \
  | gzip > /tmp/dump.sql.gz \
  && aws s3 cp /tmp/dump.sql.gz s3://tm-prod-assets/migration/dump.sql.gz
Enter fullscreen mode Exit fullscreen mode

On the Lightsail VM, pull from S3 (via a presigned URL since Lightsail VMs don't have IAM roles by default), gunzip, and pipe into the local Postgres container:

gunzip -c /tmp/dump.sql.gz | docker compose exec -T postgres psql -U tmadmin -d toolmango
Enter fullscreen mode Exit fullscreen mode

64 published tools transferred cleanly. ~485KB of data total (it's a directory site — barely any data).

Building images on the VM

Lightsail is x86_64. Fargate was ARM64. So I had to rebuild for x86 anyway, which is a perfect excuse to build directly on the VM and skip the registry-push dance:

docker build --network=host -f Dockerfile.web -t tm-web:latest \
  --build-arg NEXT_PUBLIC_SITE_URL=https://toolmango.com \
  --build-arg NEXT_PUBLIC_PLAUSIBLE_DOMAIN=toolmango.com \
  .
Enter fullscreen mode Exit fullscreen mode

Next.js builds need ~1.5-2GB peak memory. Lightsail's "small_3_0" has 2GB RAM. Tight, but adding 2GB swap solved it:

sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile
echo "/swapfile none swap sw 0 0" | sudo tee -a /etc/fstab
Enter fullscreen mode Exit fullscreen mode

First build: ~6 min. Subsequent builds with Docker layer cache: ~2 min. Acceptable.

Tearing down AWS Fargate

After Lightsail was serving traffic, I tore down the Fargate stacks via CDK:

# Take final Aurora snapshot first (safety rollback)
aws rds create-db-cluster-snapshot --db-cluster-identifier ... --db-cluster-snapshot-identifier tm-prod-final-...

# Disable deletion protection
aws rds modify-db-cluster --no-deletion-protection ...

# Delete Aurora cluster + writer
aws rds delete-db-instance --skip-final-snapshot ...
aws rds delete-db-cluster --skip-final-snapshot ...

# CDK destroy stacks in reverse dependency order
cdk destroy tm-prod-edge --force      # CloudFront, WAF
cdk destroy tm-prod-compute --force   # Fargate, ALB, ECS cluster
cdk destroy tm-prod-data --force      # ElastiCache (S3 retains via RemovalPolicy.RETAIN)
cdk destroy tm-prod-network --force   # VPC, NAT, subnets
Enter fullscreen mode Exit fullscreen mode

CloudFront's destroy is the slowest — disabling a distribution then deleting it takes 15-20 min. Aurora delete is 5-10 min. Compute and network are 3-5 min each.

Total teardown: ~30-40 min unattended.

Auto-deploy via GitHub Actions

The piece that ties it all together: a workflow that on push to main rsyncs the source, rebuilds images on the VM, runs Prisma migrations, restarts containers:

- name: Rsync source to Lightsail
  run: |
    rsync -az --delete --exclude='node_modules' --exclude='.next' --exclude='.git' \
      -e "ssh -i ~/.ssh/id_ed25519" \
      ./ ${{ secrets.LIGHTSAIL_USER }}@${{ secrets.LIGHTSAIL_HOST }}:/home/ubuntu/toolmango/src/

- name: Build images on Lightsail
  run: |
    ssh ... 'cd /home/ubuntu/toolmango/src && \
      sg docker -c "docker build -f Dockerfile.web -t tm-web:latest ." && \
      sg docker -c "docker build -f Dockerfile.worker -t tm-worker:latest ."'

- name: Run prisma migrate + restart services
  run: |
    ssh ... 'cd /home/ubuntu/toolmango && \
      sg docker -c "docker compose run --rm --no-deps web npx prisma migrate deploy" && \
      sg docker -c "docker compose up -d --force-recreate web worker"'

- name: Smoke test
  run: |
    for i in {1..6}; do
      [ "$(curl -s -o /dev/null -w '%{http_code}' https://toolmango.com/api/healthz)" = "200" ] && exit 0
      sleep 5
    done
    exit 1
Enter fullscreen mode Exit fullscreen mode

First successful auto-deploy: 3m50s end-to-end. From git push to verified 200 OK from Caddy.

What it costs now

Resource $/mo
Lightsail small_3_0 (2 vCPU, 2 GB RAM, 60 GB SSD, 3 TB transfer) $12
S3 (assets bucket — kept) $1
Route53 (zone + queries) $1
Secrets Manager (kept for credentials backup, ~10 secrets) $4
Anthropic API (Claude Sonnet for editorial agents) $5–15
Total ~$23–33

That's a 93% cut from $345/mo. Same site, same functionality, same automation pipeline. The site is at https://toolmango.com if you want to verify it's actually working.

What I gave up

Honest list of what's worse on Lightsail:

  1. No multi-AZ HA. Single VM = single point of failure. AZ-level outage means downtime.
  2. No Aurora point-in-time restore. Just nightly pg_dump to S3. RPO is up to 24h. Acceptable for a content site, not for transactional data.
  3. No autoscaling. Vertical only — bump to a bigger Lightsail bundle if traffic grows. The next tier is $24/mo for 4GB / 80GB. Past that, $44/mo for 8GB. At those numbers you should rethink Lightsail vs going back to managed services.
  4. Manual ops. No service auto-restart on host failure. If the VM dies, I get notified by uptime monitor and SSH in. That's the trade.

What I'd do differently

  1. Skip Fargate entirely for pre-revenue projects. Start on Lightsail. The migration was 4 hours; if I'd started there, that's 4 hours and ~$700 of avoided bills (the 6 days I was on Fargate).
  2. Don't enable Container Insights "just because." It's $5-15/mo and you'll never look at it on a small project.
  3. Don't let CDK enable WAF by default. WAF is real money ($12-15/mo) for a pre-revenue site that's not under attack. CloudFront's free Shield Standard is enough.
  4. Don't pre-provision multi-AZ NAT. Single NAT is fine until you have customers.
  5. Use Aurora minCapacity: 0 from day 1. The auto-pause feature added in 2024 makes Aurora Serverless v2 actually serverless. Most CDK examples still default to 0.5.

When to migrate back to Fargate

The CDK code is still in the repo. cdk deploy --all brings the production-grade Fargate stack back up; restore the latest Lightsail backup to a fresh Aurora cluster; cutover DNS.

I'll do that when any of these hits:

  • Sustained traffic > 200 req/sec (single VM saturates)
  • Need for multi-AZ HA (revenue at risk from single AZ outage)
  • DB > 20 GB (Postgres on local SSD becomes risky for backup/recovery)
  • Compliance requirement (SOC 2 etc.)

Until then, $25/mo. The CDK code waits patiently.

The site

If you want to look at what this stack actually serves: ToolMango. The methodology behind the editorial ROI score is at /about, the affiliate disclosure is at /disclosure.

The writeup of the migration is in docs/lightsail-migration.md in the repo if you want the step-by-step. Happy to answer questions in the comments.


If you found this useful and you've done a similar AWS-to-cheap-VM migration, I'd love to hear what you cut and what burned.

Top comments (0)