DEV Community

Cover image for Spot the Savings 🎯 Run Non-Critical Workloads on Azure Spot VMs and Save Up to 90%
Suhas Mallesh
Suhas Mallesh

Posted on

Spot the Savings 🎯 Run Non-Critical Workloads on Azure Spot VMs and Save Up to 90%

Your CI runners, batch jobs, and dev environments don't need guaranteed availability. Azure Spot VMs use unused capacity at up to 90% discount. Here's how to deploy them safely with Terraform, handle evictions gracefully, and build a hybrid Spot + on-demand architecture.

A team runs 10 D4s_v5 VMs for CI/CD build agents at $0.192/hr each. All pay-as-you-go, all running 24/7. Monthly cost: $1,401. Switch to Spot VMs at ~$0.037/hr? Monthly cost: $270. That's $1,131/month saved, or $13,572/year. For build agents that can just retry a failed job if evicted. 🎯

Azure Spot VMs use unused datacenter capacity at massive discounts:

Standard_D4s_v5 (4 vCPU, 16 GB, Linux, East US):
  Pay-as-you-go:  $0.192/hr   ($140/month)
  Spot price:     ~$0.037/hr   (~$27/month)
  Savings:        ~81%

Standard_D2s_v5 (2 vCPU, 8 GB, Linux, East US):
  Pay-as-you-go:  $0.096/hr   ($70/month)
  Spot price:     ~$0.019/hr   (~$14/month)
  Savings:        ~80%
Enter fullscreen mode Exit fullscreen mode

The trade-off is simple: Azure can evict your Spot VM with 30 seconds notice when it needs the capacity back. For workloads that can handle interruptions (CI/CD, batch processing, dev/test, data pipelines), this is a no-brainer. For your production API serving live traffic? Absolutely not.

Let's build a Spot VM strategy with Terraform that's safe, cost-effective, and handles evictions gracefully. ⚡

🎯 What Makes a Good Spot VM Workload?

Great for Spot (can handle interruptions):

  • CI/CD build agents and test runners
  • Batch processing and data pipelines
  • Dev and test environments
  • Image/video rendering
  • Machine learning training jobs
  • Background workers with retry logic
  • Load testing and performance benchmarks

Never use Spot for (needs guaranteed availability):

  • Production APIs and web servers serving live traffic
  • Databases (primary or replica)
  • Domain controllers and DNS servers
  • Monitoring and alerting infrastructure
  • Long-running jobs (24h+) without checkpointing

The golden rule: if a 30-second eviction notice would cause data loss or user-facing downtime, don't use Spot.

⚡ Step 1: Single Spot VM with Terraform

The simplest Spot VM deployment - great for dev environments and single-purpose workers:

resource "azurerm_linux_virtual_machine" "spot_worker" {
  name                = "vm-worker-spot-01"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  size                = "Standard_D4s_v5"
  admin_username      = "azureadmin"

  # ──── Spot VM Configuration ────
  priority        = "Spot"
  eviction_policy = "Deallocate"  # or "Delete"
  max_bid_price   = -1            # Pay up to PAYG price (capacity-only eviction)

  network_interface_ids = [azurerm_network_interface.worker.id]

  admin_ssh_key {
    username   = "azureadmin"
    public_key = file("~/.ssh/id_rsa.pub")
  }

  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "StandardSSD_LRS"  # No need for Premium on Spot
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-jammy"
    sku       = "22_04-lts-gen2"
    version   = "latest"
  }

  tags = {
    Environment  = "dev"
    Priority     = "Spot"
    WorkloadType = "batch-worker"
    ManagedBy    = "terraform"
  }
}
Enter fullscreen mode Exit fullscreen mode

Key parameters explained:

priority = "Spot" - Makes this a Spot VM instead of regular (on-demand).

eviction_policy = "Deallocate" - When evicted, the VM stops but the disk is preserved. You can restart it later when capacity is available. You still pay for disk storage. Use "Delete" for ephemeral workloads where you don't need the disk.

max_bid_price = -1 - Pay up to the standard pay-as-you-go rate. The VM is only evicted for capacity reasons, never for price reasons. This maximizes uptime. Set a specific value like 0.05 if you want a strict cost cap (but increases eviction risk).

🏗️ Step 2: Spot VMSS for CI/CD Runners (Auto-Replacing)

Single Spot VMs don't auto-recover from eviction. For workloads that need auto-replacement, use a Virtual Machine Scale Set:

resource "azurerm_linux_virtual_machine_scale_set" "ci_runners" {
  name                = "vmss-ci-runners"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  sku                 = "Standard_D4s_v5"
  instances           = 5

  # ──── Spot Configuration ────
  priority        = "Spot"
  eviction_policy = "Delete"     # Delete on eviction, VMSS creates replacement
  max_bid_price   = -1           # Capacity-only eviction

  admin_username = "azureadmin"

  admin_ssh_key {
    username   = "azureadmin"
    public_key = file("~/.ssh/id_rsa.pub")
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-jammy"
    sku       = "22_04-lts-gen2"
    version   = "latest"
  }

  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "StandardSSD_LRS"
  }

  network_interface {
    name    = "ci-runner-nic"
    primary = true

    ip_configuration {
      name      = "internal"
      primary   = true
      subnet_id = azurerm_subnet.ci.id
    }
  }

  # Auto-repair: replace unhealthy/evicted instances
  automatic_instance_repair {
    enabled      = true
    grace_period = "PT10M"
  }

  tags = {
    Environment  = "shared"
    Priority     = "Spot"
    WorkloadType = "ci-runner"
    ManagedBy    = "terraform"
  }
}
Enter fullscreen mode Exit fullscreen mode

When a Spot instance gets evicted, the VMSS detects the missing instance and automatically provisions a replacement (if capacity is available). Your CI/CD pipeline retries the failed build, the new runner picks it up, and work continues. No manual intervention. 🔄

🔧 Step 3: Hybrid Architecture - Spot + On-Demand Fallback

The production-grade pattern: a small baseline of guaranteed on-demand VMs with Spot VMs handling the bulk of the work:

# ──── Baseline: 2 On-Demand VMs (always available) ────
resource "azurerm_linux_virtual_machine_scale_set" "baseline_runners" {
  name                = "vmss-ci-baseline"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  sku                 = "Standard_D2s_v5"
  instances           = 2

  priority = "Regular"  # On-demand, never evicted

  admin_username = "azureadmin"
  admin_ssh_key {
    username   = "azureadmin"
    public_key = file("~/.ssh/id_rsa.pub")
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-jammy"
    sku       = "22_04-lts-gen2"
    version   = "latest"
  }

  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "StandardSSD_LRS"
  }

  network_interface {
    name    = "baseline-nic"
    primary = true
    ip_configuration {
      name      = "internal"
      primary   = true
      subnet_id = azurerm_subnet.ci.id
    }
  }

  tags = {
    Role     = "ci-baseline"
    Priority = "Regular"
  }
}

# ──── Burst: 8 Spot VMs (cheap capacity for parallel builds) ────
resource "azurerm_linux_virtual_machine_scale_set" "spot_runners" {
  name                = "vmss-ci-spot"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  sku                 = "Standard_D4s_v5"
  instances           = 8

  priority        = "Spot"
  eviction_policy = "Delete"
  max_bid_price   = -1

  admin_username = "azureadmin"
  admin_ssh_key {
    username   = "azureadmin"
    public_key = file("~/.ssh/id_rsa.pub")
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-jammy"
    sku       = "22_04-lts-gen2"
    version   = "latest"
  }

  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "StandardSSD_LRS"
  }

  network_interface {
    name    = "spot-nic"
    primary = true
    ip_configuration {
      name      = "internal"
      primary   = true
      subnet_id = azurerm_subnet.ci.id
    }
  }

  automatic_instance_repair {
    enabled      = true
    grace_period = "PT10M"
  }

  tags = {
    Role     = "ci-burst"
    Priority = "Spot"
  }
}
Enter fullscreen mode Exit fullscreen mode

The cost math for this hybrid pattern:

All on-demand (10 x D4s_v5):
  10 x $140/month = $1,401/month

Hybrid (2 baseline + 8 Spot):
  Baseline: 2 x D2s_v5 x $70/month  = $140
  Spot:     8 x D4s_v5 x ~$27/month = $216
  Total:                             = $356/month

Monthly savings: $1,045
Annual savings:  $12,540 💰
Enter fullscreen mode Exit fullscreen mode

You keep 2 guaranteed runners that are always available for critical builds. The 8 Spot runners handle parallel workloads at ~81% discount. If Spot capacity drops, you lose burst capacity but never lose your baseline. Builds queue up slightly but never stop completely.

🔍 Step 4: AKS Spot Node Pools

Running Kubernetes? Add a Spot node pool for non-critical workloads:

resource "azurerm_kubernetes_cluster_node_pool" "spot" {
  name                  = "spotpool"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_D4s_v5"
  node_count            = 3

  priority        = "Spot"
  eviction_policy = "Delete"
  spot_max_price  = -1

  os_disk_type    = "Ephemeral"
  os_disk_size_gb = 128

  auto_scaling_enabled = true
  min_count            = 1
  max_count            = 10

  node_labels = {
    "kubernetes.azure.com/scalesetpriority" = "spot"
  }

  node_taints = [
    "kubernetes.azure.com/scalesetpriority=spot:NoSchedule"
  ]

  tags = {
    Priority = "Spot"
  }
}
Enter fullscreen mode Exit fullscreen mode

Use Kubernetes tolerations and node selectors in your deployments to schedule fault-tolerant workloads on Spot nodes while keeping critical services on the regular node pool. The NoSchedule taint prevents pods from accidentally landing on Spot nodes unless they explicitly opt in.

⚡ Quick Audit: Check Your Spot VM Opportunities

# List all VMs and their priority (Regular vs Spot)
az vm list -d \
  --query "[].{Name:name, Size:hardwareProfile.vmSize, Priority:priority, State:powerState, RG:resourceGroup}" \
  --output table

# Check spot pricing for a specific SKU
az rest --method get \
  --url "https://management.azure.com/subscriptions/{sub-id}/providers/Microsoft.Compute/spotPlacementRecommender/generate?api-version=2024-07-01" 2>/dev/null || \
  echo "Use Azure Portal > Create VM > Spot > View pricing history for eviction rates"

# Find VMs tagged as dev/test that could be Spot candidates
az vm list \
  --query "[?tags.Environment=='dev' || tags.Environment=='test'].{Name:name, Size:hardwareProfile.vmSize, Priority:priority}" \
  --output table
Enter fullscreen mode Exit fullscreen mode

Every VM in that last output that shows priority: null (Regular) is a candidate for Spot. Dev and test VMs almost always qualify. 🔥

💡 Architect Pro Tips

  • Set max_bid_price = -1 for maximum uptime. This means "pay up to the regular PAYG rate." You only get evicted for capacity reasons, not price spikes. In practice, Spot prices are quite stable for most SKUs. Setting a specific price cap increases your eviction risk significantly.

  • Use eviction_policy = "Delete" for stateless workloads. Deallocated VMs still incur disk storage costs and count against your quota. For CI runners, batch workers, and other stateless jobs, Delete is cleaner and cheaper.

  • Diversify VM sizes to increase availability. Instead of requesting 10 Spot D4s_v5 VMs, consider accepting D4s_v5, D4as_v5, and D4ds_v5. Different SKUs pull from different capacity pools, reducing your eviction risk.

  • B-series VMs don't support Spot pricing. If you're already running cheap B-series for dev, you can't make them cheaper with Spot. Spot is most impactful for D-series and larger SKUs where the discount is substantial.

  • Start small: 10-20% of your workload on Spot. Monitor eviction rates for 2-4 weeks, then scale up. This gives you real data on how often evictions happen for your chosen SKU and region.

  • Use Azure Scheduled Events for graceful shutdown. Your applications can poll the Instance Metadata Service (IMDS) endpoint to detect pending evictions and save state before the 30-second deadline.

  • Check eviction rates before committing. In the Azure Portal, go to Create VM, select Spot, then click "View pricing history and compare prices in nearby regions." This shows historical eviction rates and pricing trends. Pick SKUs and regions with lower eviction rates.

  • Combine Spot with Azure Hybrid Benefit. If you're running Windows VMs, you still save on the Spot discount AND the licensing cost. The two benefits stack.

📊 TL;DR

Workload Architecture On-Demand Cost With Spot Annual Savings
10 CI runners All Spot VMSS $1,401/mo $270/mo $13,572
10 CI runners 2 baseline + 8 Spot $1,401/mo $356/mo $12,540
5 batch workers All Spot $701/mo $135/mo $6,792
3 AKS Spot nodes Spot node pool $420/mo $81/mo $4,068
Dev environment Single Spot VMs $280/mo $54/mo $2,712

Bottom line: Spot VMs are the single largest discount available on Azure compute: up to 90% off. The only cost is accepting the possibility of eviction. For any workload that can retry, checkpoint, or degrade gracefully, there's no reason to pay full price. Build a hybrid architecture with on-demand baseline and Spot burst capacity, and you get reliability with massive savings. 🎯


Go check your dev/test VMs right now. Count how many are running as Regular priority when they could be Spot. Multiply the count by ~$100/month savings per VM. That's money you're leaving on the table every single month. 😀

This is Part 8 of the "Save on Azure with Terraform" series. Next up: FinOps as Code 🧮. Building a complete cost governance framework with Terraform and Azure Policy. 💬

Top comments (0)