Suhas Mallesh

Posted on Feb 19

Stop Paying Full Price: Spot VMs Can Cut Your GCP Compute Bill by 91% 💰

#terraform #devops #cloud #gcp

That e2-standard-4 running your batch jobs costs $0.134/hour. The same VM as a Spot instance? $0.040/hour. That's 70% off for changing 4 lines of Terraform. If your workload can handle interruptions, you're throwing money away running on-demand.

GCP Spot VMs use excess Compute Engine capacity at 60-91% discounts compared to on-demand pricing. The catch? Google can reclaim them at any time with just 30 seconds notice. But here's the thing: most dev environments, batch jobs, CI/CD pipelines, and data processing workloads don't need guaranteed uptime. They just need cheap compute.

Let's set them up properly with Terraform so you save thousands without losing sleep.

📊 Spot vs On-Demand: The Real Numbers

Machine Type	On-Demand/hr	Spot/hr	Savings	Monthly Savings (24/7)
e2-medium (2 vCPU, 4GB)	$0.067	$0.020	70%	~$34
e2-standard-4 (4 vCPU, 16GB)	$0.134	$0.040	70%	~$68
n2-standard-8 (8 vCPU, 32GB)	$0.388	$0.117	70%	~$196
c2-standard-16 (16 vCPU, 64GB)	$0.835	$0.250	70%	~$421
GPU (T4)	$0.35	$0.11	69%	~$173

Multiply that by 10 dev VMs and you're looking at $2,000+/month saved with zero architecture changes.

⚠️ Spot pricing is dynamic and varies by region. These are approximate US region prices. Always check the GCP pricing calculator for current rates.

🔧 Step 1: Convert Any VM to Spot with 4 Lines

The magic is in the scheduling block. Here's how a standard VM becomes a Spot VM:

resource "google_compute_instance" "worker" {
  name         = "batch-worker-01"
  machine_type = "e2-standard-4"
  zone         = "us-central1-a"

  boot_disk {
    initialize_params {
      image = "debian-cloud/debian-12"
    }
  }

  # 👇 These 4 lines save you 60-91%
  scheduling {
    preemptible                 = true
    automatic_restart           = false
    provisioning_model          = "SPOT"
    instance_termination_action = "STOP"  # or "DELETE"
  }

  network_interface {
    network    = "default"
    subnetwork = var.subnet_id
  }

  labels = local.common_labels
}

⚠️ Gotcha: You need ALL of preemptible = true, automatic_restart = false, AND provisioning_model = "SPOT". Missing any one of these will either fail validation or silently create an on-demand VM. The instance_termination_action controls what happens when Google reclaims: STOP preserves the disk so you can restart later, DELETE destroys everything.

🔄 Step 2: Auto-Recover with Managed Instance Groups

A single Spot VM that gets preempted stays down. A Managed Instance Group (MIG) automatically recreates it when capacity comes back. This is the production-ready pattern.

First, create an instance template:

resource "google_compute_instance_template" "spot_worker" {
  name_prefix  = "spot-worker-"
  machine_type = "e2-standard-4"
  region       = var.region

  disk {
    source_image = "debian-cloud/debian-12"
    auto_delete  = true
    boot         = true
    disk_size_gb = 20
  }

  scheduling {
    preemptible                 = true
    automatic_restart           = false
    provisioning_model          = "SPOT"
    instance_termination_action = "STOP"
  }

  network_interface {
    network    = var.network_id
    subnetwork = var.subnet_id
  }

  # Shutdown script: save state before Google pulls the plug
  metadata = {
    shutdown-script = <<-EOF
      #!/bin/bash
      echo "$(date): Spot VM preempted, saving state..." >> /var/log/preemption.log
      # Add your checkpoint/save logic here
      # e.g., gsutil cp /tmp/progress.json gs://my-bucket/checkpoints/
    EOF
  }

  labels = local.common_labels

  lifecycle {
    create_before_destroy = true
  }
}

Then, create the MIG with autoscaling:

resource "google_compute_region_instance_group_manager" "spot_workers" {
  name               = "spot-worker-mig"
  base_instance_name = "spot-worker"
  region             = var.region

  version {
    instance_template = google_compute_instance_template.spot_worker.id
  }

  target_size = var.desired_instances

  # Spread across zones for better Spot availability
  distribution_policy_zones = [
    "us-central1-a",
    "us-central1-b",
    "us-central1-c",
  ]

  update_policy {
    type                  = "PROACTIVE"
    minimal_action        = "REPLACE"
    max_surge_fixed       = 3
    max_unavailable_fixed = 0
  }

  auto_healing_policies {
    health_check      = google_compute_health_check.worker.id
    initial_delay_sec = 120
  }
}

resource "google_compute_health_check" "worker" {
  name                = "spot-worker-health"
  check_interval_sec  = 30
  timeout_sec         = 10
  healthy_threshold   = 2
  unhealthy_threshold = 3

  tcp_health_check {
    port = 8080
  }
}

Now when Google preempts a Spot VM, the MIG automatically requests a new one. No manual intervention. No pager alerts at 3 AM. 🎉

🎯 Step 3: Mix Spot and On-Demand for Production

For production workloads, run a baseline on on-demand and burst on Spot. This gives you guaranteed capacity plus cheap scaling:

# Baseline: always-on, on-demand instances
resource "google_compute_instance_template" "ondemand_baseline" {
  name_prefix  = "baseline-"
  machine_type = "e2-standard-4"

  disk {
    source_image = "debian-cloud/debian-12"
    auto_delete  = true
    boot         = true
  }

  scheduling {
    provisioning_model = "STANDARD"  # On-demand
    automatic_restart  = true
  }

  network_interface {
    network    = var.network_id
    subnetwork = var.subnet_id
  }

  labels = merge(local.common_labels, { tier = "baseline" })
}

# Burst: cheap Spot instances for extra capacity
resource "google_compute_instance_template" "spot_burst" {
  name_prefix  = "burst-"
  machine_type = "e2-standard-4"

  disk {
    source_image = "debian-cloud/debian-12"
    auto_delete  = true
    boot         = true
  }

  scheduling {
    preemptible                 = true
    automatic_restart           = false
    provisioning_model          = "SPOT"
    instance_termination_action = "DELETE"
  }

  network_interface {
    network    = var.network_id
    subnetwork = var.subnet_id
  }

  labels = merge(local.common_labels, { tier = "burst" })
}

Your load balancer routes to both pools. When traffic spikes, autoscaling adds cheap Spot VMs. When Google reclaims them, the on-demand baseline keeps serving.

🧠 What Works on Spot (and What Doesn't)

Workload	Spot?	Why
Dev/staging environments	✅ Yes	Nobody cares if dev goes down for 2 minutes
CI/CD build runners	✅ Yes	Builds restart automatically
Batch data processing	✅ Yes	Checkpoint and resume
ML training (with checkpoints)	✅ Yes	Save model every N epochs
Stateless web frontends (in MIG)	✅ Yes	MIG replaces preempted VMs instantly
Production databases	❌ No	Data loss risk, needs guaranteed uptime
Stateful singleton services	❌ No	Can't handle random restarts
Real-time payment processing	❌ No	30 seconds of downtime = lost revenue

🛡️ Step 4: The Shutdown Script Safety Net

You get 30 seconds before Google terminates a Spot VM. Use that time wisely:

#!/bin/bash
# shutdown-script for Spot VMs

# Save application state
echo "Preemption detected at $(date)" >> /var/log/preemption.log

# Drain connections gracefully
if systemctl is-active --quiet nginx; then
  /usr/sbin/nginx -s quit
fi

# Upload checkpoint to GCS
CHECKPOINT_FILE="/tmp/job-checkpoint.json"
if [ -f "$CHECKPOINT_FILE" ]; then
  gsutil cp "$CHECKPOINT_FILE" "gs://${BUCKET}/checkpoints/$(hostname)-$(date +%s).json"
fi

# Deregister from service discovery
curl -s -X DELETE "http://consul:8500/v1/agent/service/deregister/$(hostname)" || true

Pass this in the instance template metadata:

metadata = {
  shutdown-script = file("${path.module}/scripts/spot-shutdown.sh")
}

💡 Quick Reference: What to Do First

Action	Effort	Savings
Convert dev VMs to Spot	5 min	60-91% per VM
Convert CI/CD runners to Spot	10 min	60-91% on build costs
Create MIG with Spot template	20 min	Auto-recovery + savings
Add shutdown scripts	15 min	Graceful preemption handling
Mix Spot + on-demand for prod	30 min	40-60% on burst capacity

Start with dev/staging VMs. It's 4 lines of Terraform, zero risk, and instant 70%+ savings. 🎯

📊 TL;DR

Spot VMs             = 60-91% cheaper than on-demand
4 lines of Terraform = preemptible + automatic_restart + provisioning_model + termination_action
30 seconds warning   = use shutdown scripts to save state
MIG + Spot           = auto-recovery when VMs get preempted
Mix Spot + on-demand = cheap burst capacity with guaranteed baseline
Dev/staging          = switch to Spot TODAY, zero risk
Production DBs       = NEVER use Spot, data loss risk
Preemptible VMs      = legacy, use Spot VMs instead (no 24hr limit)

Bottom line: If you're running dev environments, CI/CD pipelines, or batch jobs on on-demand VMs, you're paying 3x more than you need to. Switch to Spot, add a MIG for auto-recovery, and keep the savings. 🔥

Go look at your dev project right now. Count the on-demand VMs. Multiply by 0.7. That's how much you're overpaying every month. Four lines of Terraform fixes it. 😀

Found this helpful? Follow for more GCP cost optimization with Terraform! 💬

DEV Community