That e2-standard-4 running your batch jobs costs $0.134/hour. The same VM as a Spot instance? $0.040/hour. That's 70% off for changing 4 lines of Terraform. If your workload can handle interruptions, you're throwing money away running on-demand.
GCP Spot VMs use excess Compute Engine capacity at 60-91% discounts compared to on-demand pricing. The catch? Google can reclaim them at any time with just 30 seconds notice. But here's the thing: most dev environments, batch jobs, CI/CD pipelines, and data processing workloads don't need guaranteed uptime. They just need cheap compute.
Let's set them up properly with Terraform so you save thousands without losing sleep.
📊 Spot vs On-Demand: The Real Numbers
| Machine Type | On-Demand/hr | Spot/hr | Savings | Monthly Savings (24/7) |
|---|---|---|---|---|
| e2-medium (2 vCPU, 4GB) | $0.067 | $0.020 | 70% | ~$34 |
| e2-standard-4 (4 vCPU, 16GB) | $0.134 | $0.040 | 70% | ~$68 |
| n2-standard-8 (8 vCPU, 32GB) | $0.388 | $0.117 | 70% | ~$196 |
| c2-standard-16 (16 vCPU, 64GB) | $0.835 | $0.250 | 70% | ~$421 |
| GPU (T4) | $0.35 | $0.11 | 69% | ~$173 |
Multiply that by 10 dev VMs and you're looking at $2,000+/month saved with zero architecture changes.
⚠️ Spot pricing is dynamic and varies by region. These are approximate US region prices. Always check the GCP pricing calculator for current rates.
🔧 Step 1: Convert Any VM to Spot with 4 Lines
The magic is in the scheduling block. Here's how a standard VM becomes a Spot VM:
resource "google_compute_instance" "worker" {
name = "batch-worker-01"
machine_type = "e2-standard-4"
zone = "us-central1-a"
boot_disk {
initialize_params {
image = "debian-cloud/debian-12"
}
}
# 👇 These 4 lines save you 60-91%
scheduling {
preemptible = true
automatic_restart = false
provisioning_model = "SPOT"
instance_termination_action = "STOP" # or "DELETE"
}
network_interface {
network = "default"
subnetwork = var.subnet_id
}
labels = local.common_labels
}
⚠️ Gotcha: You need ALL of
preemptible = true,automatic_restart = false, ANDprovisioning_model = "SPOT". Missing any one of these will either fail validation or silently create an on-demand VM. Theinstance_termination_actioncontrols what happens when Google reclaims:STOPpreserves the disk so you can restart later,DELETEdestroys everything.
🔄 Step 2: Auto-Recover with Managed Instance Groups
A single Spot VM that gets preempted stays down. A Managed Instance Group (MIG) automatically recreates it when capacity comes back. This is the production-ready pattern.
First, create an instance template:
resource "google_compute_instance_template" "spot_worker" {
name_prefix = "spot-worker-"
machine_type = "e2-standard-4"
region = var.region
disk {
source_image = "debian-cloud/debian-12"
auto_delete = true
boot = true
disk_size_gb = 20
}
scheduling {
preemptible = true
automatic_restart = false
provisioning_model = "SPOT"
instance_termination_action = "STOP"
}
network_interface {
network = var.network_id
subnetwork = var.subnet_id
}
# Shutdown script: save state before Google pulls the plug
metadata = {
shutdown-script = <<-EOF
#!/bin/bash
echo "$(date): Spot VM preempted, saving state..." >> /var/log/preemption.log
# Add your checkpoint/save logic here
# e.g., gsutil cp /tmp/progress.json gs://my-bucket/checkpoints/
EOF
}
labels = local.common_labels
lifecycle {
create_before_destroy = true
}
}
Then, create the MIG with autoscaling:
resource "google_compute_region_instance_group_manager" "spot_workers" {
name = "spot-worker-mig"
base_instance_name = "spot-worker"
region = var.region
version {
instance_template = google_compute_instance_template.spot_worker.id
}
target_size = var.desired_instances
# Spread across zones for better Spot availability
distribution_policy_zones = [
"us-central1-a",
"us-central1-b",
"us-central1-c",
]
update_policy {
type = "PROACTIVE"
minimal_action = "REPLACE"
max_surge_fixed = 3
max_unavailable_fixed = 0
}
auto_healing_policies {
health_check = google_compute_health_check.worker.id
initial_delay_sec = 120
}
}
resource "google_compute_health_check" "worker" {
name = "spot-worker-health"
check_interval_sec = 30
timeout_sec = 10
healthy_threshold = 2
unhealthy_threshold = 3
tcp_health_check {
port = 8080
}
}
Now when Google preempts a Spot VM, the MIG automatically requests a new one. No manual intervention. No pager alerts at 3 AM. 🎉
🎯 Step 3: Mix Spot and On-Demand for Production
For production workloads, run a baseline on on-demand and burst on Spot. This gives you guaranteed capacity plus cheap scaling:
# Baseline: always-on, on-demand instances
resource "google_compute_instance_template" "ondemand_baseline" {
name_prefix = "baseline-"
machine_type = "e2-standard-4"
disk {
source_image = "debian-cloud/debian-12"
auto_delete = true
boot = true
}
scheduling {
provisioning_model = "STANDARD" # On-demand
automatic_restart = true
}
network_interface {
network = var.network_id
subnetwork = var.subnet_id
}
labels = merge(local.common_labels, { tier = "baseline" })
}
# Burst: cheap Spot instances for extra capacity
resource "google_compute_instance_template" "spot_burst" {
name_prefix = "burst-"
machine_type = "e2-standard-4"
disk {
source_image = "debian-cloud/debian-12"
auto_delete = true
boot = true
}
scheduling {
preemptible = true
automatic_restart = false
provisioning_model = "SPOT"
instance_termination_action = "DELETE"
}
network_interface {
network = var.network_id
subnetwork = var.subnet_id
}
labels = merge(local.common_labels, { tier = "burst" })
}
Your load balancer routes to both pools. When traffic spikes, autoscaling adds cheap Spot VMs. When Google reclaims them, the on-demand baseline keeps serving.
🧠 What Works on Spot (and What Doesn't)
| Workload | Spot? | Why |
|---|---|---|
| Dev/staging environments | ✅ Yes | Nobody cares if dev goes down for 2 minutes |
| CI/CD build runners | ✅ Yes | Builds restart automatically |
| Batch data processing | ✅ Yes | Checkpoint and resume |
| ML training (with checkpoints) | ✅ Yes | Save model every N epochs |
| Stateless web frontends (in MIG) | ✅ Yes | MIG replaces preempted VMs instantly |
| Production databases | ❌ No | Data loss risk, needs guaranteed uptime |
| Stateful singleton services | ❌ No | Can't handle random restarts |
| Real-time payment processing | ❌ No | 30 seconds of downtime = lost revenue |
🛡️ Step 4: The Shutdown Script Safety Net
You get 30 seconds before Google terminates a Spot VM. Use that time wisely:
#!/bin/bash
# shutdown-script for Spot VMs
# Save application state
echo "Preemption detected at $(date)" >> /var/log/preemption.log
# Drain connections gracefully
if systemctl is-active --quiet nginx; then
/usr/sbin/nginx -s quit
fi
# Upload checkpoint to GCS
CHECKPOINT_FILE="/tmp/job-checkpoint.json"
if [ -f "$CHECKPOINT_FILE" ]; then
gsutil cp "$CHECKPOINT_FILE" "gs://${BUCKET}/checkpoints/$(hostname)-$(date +%s).json"
fi
# Deregister from service discovery
curl -s -X DELETE "http://consul:8500/v1/agent/service/deregister/$(hostname)" || true
Pass this in the instance template metadata:
metadata = {
shutdown-script = file("${path.module}/scripts/spot-shutdown.sh")
}
💡 Quick Reference: What to Do First
| Action | Effort | Savings |
|---|---|---|
| Convert dev VMs to Spot | 5 min | 60-91% per VM |
| Convert CI/CD runners to Spot | 10 min | 60-91% on build costs |
| Create MIG with Spot template | 20 min | Auto-recovery + savings |
| Add shutdown scripts | 15 min | Graceful preemption handling |
| Mix Spot + on-demand for prod | 30 min | 40-60% on burst capacity |
Start with dev/staging VMs. It's 4 lines of Terraform, zero risk, and instant 70%+ savings. 🎯
📊 TL;DR
Spot VMs = 60-91% cheaper than on-demand
4 lines of Terraform = preemptible + automatic_restart + provisioning_model + termination_action
30 seconds warning = use shutdown scripts to save state
MIG + Spot = auto-recovery when VMs get preempted
Mix Spot + on-demand = cheap burst capacity with guaranteed baseline
Dev/staging = switch to Spot TODAY, zero risk
Production DBs = NEVER use Spot, data loss risk
Preemptible VMs = legacy, use Spot VMs instead (no 24hr limit)
Bottom line: If you're running dev environments, CI/CD pipelines, or batch jobs on on-demand VMs, you're paying 3x more than you need to. Switch to Spot, add a MIG for auto-recovery, and keep the savings. 🔥
Go look at your dev project right now. Count the on-demand VMs. Multiply by 0.7. That's how much you're overpaying every month. Four lines of Terraform fixes it. 😀
Found this helpful? Follow for more GCP cost optimization with Terraform! 💬
Top comments (0)