Your CI runners, batch jobs, and dev environments don't need guaranteed availability. Azure Spot VMs use unused capacity at up to 90% discount. Here's how to deploy them safely with Terraform, handle evictions gracefully, and build a hybrid Spot + on-demand architecture.
A team runs 10 D4s_v5 VMs for CI/CD build agents at $0.192/hr each. All pay-as-you-go, all running 24/7. Monthly cost: $1,401. Switch to Spot VMs at ~$0.037/hr? Monthly cost: $270. That's $1,131/month saved, or $13,572/year. For build agents that can just retry a failed job if evicted. 🎯
Azure Spot VMs use unused datacenter capacity at massive discounts:
Standard_D4s_v5 (4 vCPU, 16 GB, Linux, East US):
Pay-as-you-go: $0.192/hr ($140/month)
Spot price: ~$0.037/hr (~$27/month)
Savings: ~81%
Standard_D2s_v5 (2 vCPU, 8 GB, Linux, East US):
Pay-as-you-go: $0.096/hr ($70/month)
Spot price: ~$0.019/hr (~$14/month)
Savings: ~80%
The trade-off is simple: Azure can evict your Spot VM with 30 seconds notice when it needs the capacity back. For workloads that can handle interruptions (CI/CD, batch processing, dev/test, data pipelines), this is a no-brainer. For your production API serving live traffic? Absolutely not.
Let's build a Spot VM strategy with Terraform that's safe, cost-effective, and handles evictions gracefully. ⚡
🎯 What Makes a Good Spot VM Workload?
Great for Spot (can handle interruptions):
- CI/CD build agents and test runners
- Batch processing and data pipelines
- Dev and test environments
- Image/video rendering
- Machine learning training jobs
- Background workers with retry logic
- Load testing and performance benchmarks
Never use Spot for (needs guaranteed availability):
- Production APIs and web servers serving live traffic
- Databases (primary or replica)
- Domain controllers and DNS servers
- Monitoring and alerting infrastructure
- Long-running jobs (24h+) without checkpointing
The golden rule: if a 30-second eviction notice would cause data loss or user-facing downtime, don't use Spot.
⚡ Step 1: Single Spot VM with Terraform
The simplest Spot VM deployment - great for dev environments and single-purpose workers:
resource "azurerm_linux_virtual_machine" "spot_worker" {
name = "vm-worker-spot-01"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
size = "Standard_D4s_v5"
admin_username = "azureadmin"
# ──── Spot VM Configuration ────
priority = "Spot"
eviction_policy = "Deallocate" # or "Delete"
max_bid_price = -1 # Pay up to PAYG price (capacity-only eviction)
network_interface_ids = [azurerm_network_interface.worker.id]
admin_ssh_key {
username = "azureadmin"
public_key = file("~/.ssh/id_rsa.pub")
}
os_disk {
caching = "ReadWrite"
storage_account_type = "StandardSSD_LRS" # No need for Premium on Spot
}
source_image_reference {
publisher = "Canonical"
offer = "0001-com-ubuntu-server-jammy"
sku = "22_04-lts-gen2"
version = "latest"
}
tags = {
Environment = "dev"
Priority = "Spot"
WorkloadType = "batch-worker"
ManagedBy = "terraform"
}
}
Key parameters explained:
priority = "Spot" - Makes this a Spot VM instead of regular (on-demand).
eviction_policy = "Deallocate" - When evicted, the VM stops but the disk is preserved. You can restart it later when capacity is available. You still pay for disk storage. Use "Delete" for ephemeral workloads where you don't need the disk.
max_bid_price = -1 - Pay up to the standard pay-as-you-go rate. The VM is only evicted for capacity reasons, never for price reasons. This maximizes uptime. Set a specific value like 0.05 if you want a strict cost cap (but increases eviction risk).
🏗️ Step 2: Spot VMSS for CI/CD Runners (Auto-Replacing)
Single Spot VMs don't auto-recover from eviction. For workloads that need auto-replacement, use a Virtual Machine Scale Set:
resource "azurerm_linux_virtual_machine_scale_set" "ci_runners" {
name = "vmss-ci-runners"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
sku = "Standard_D4s_v5"
instances = 5
# ──── Spot Configuration ────
priority = "Spot"
eviction_policy = "Delete" # Delete on eviction, VMSS creates replacement
max_bid_price = -1 # Capacity-only eviction
admin_username = "azureadmin"
admin_ssh_key {
username = "azureadmin"
public_key = file("~/.ssh/id_rsa.pub")
}
source_image_reference {
publisher = "Canonical"
offer = "0001-com-ubuntu-server-jammy"
sku = "22_04-lts-gen2"
version = "latest"
}
os_disk {
caching = "ReadWrite"
storage_account_type = "StandardSSD_LRS"
}
network_interface {
name = "ci-runner-nic"
primary = true
ip_configuration {
name = "internal"
primary = true
subnet_id = azurerm_subnet.ci.id
}
}
# Auto-repair: replace unhealthy/evicted instances
automatic_instance_repair {
enabled = true
grace_period = "PT10M"
}
tags = {
Environment = "shared"
Priority = "Spot"
WorkloadType = "ci-runner"
ManagedBy = "terraform"
}
}
When a Spot instance gets evicted, the VMSS detects the missing instance and automatically provisions a replacement (if capacity is available). Your CI/CD pipeline retries the failed build, the new runner picks it up, and work continues. No manual intervention. 🔄
🔧 Step 3: Hybrid Architecture - Spot + On-Demand Fallback
The production-grade pattern: a small baseline of guaranteed on-demand VMs with Spot VMs handling the bulk of the work:
# ──── Baseline: 2 On-Demand VMs (always available) ────
resource "azurerm_linux_virtual_machine_scale_set" "baseline_runners" {
name = "vmss-ci-baseline"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
sku = "Standard_D2s_v5"
instances = 2
priority = "Regular" # On-demand, never evicted
admin_username = "azureadmin"
admin_ssh_key {
username = "azureadmin"
public_key = file("~/.ssh/id_rsa.pub")
}
source_image_reference {
publisher = "Canonical"
offer = "0001-com-ubuntu-server-jammy"
sku = "22_04-lts-gen2"
version = "latest"
}
os_disk {
caching = "ReadWrite"
storage_account_type = "StandardSSD_LRS"
}
network_interface {
name = "baseline-nic"
primary = true
ip_configuration {
name = "internal"
primary = true
subnet_id = azurerm_subnet.ci.id
}
}
tags = {
Role = "ci-baseline"
Priority = "Regular"
}
}
# ──── Burst: 8 Spot VMs (cheap capacity for parallel builds) ────
resource "azurerm_linux_virtual_machine_scale_set" "spot_runners" {
name = "vmss-ci-spot"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
sku = "Standard_D4s_v5"
instances = 8
priority = "Spot"
eviction_policy = "Delete"
max_bid_price = -1
admin_username = "azureadmin"
admin_ssh_key {
username = "azureadmin"
public_key = file("~/.ssh/id_rsa.pub")
}
source_image_reference {
publisher = "Canonical"
offer = "0001-com-ubuntu-server-jammy"
sku = "22_04-lts-gen2"
version = "latest"
}
os_disk {
caching = "ReadWrite"
storage_account_type = "StandardSSD_LRS"
}
network_interface {
name = "spot-nic"
primary = true
ip_configuration {
name = "internal"
primary = true
subnet_id = azurerm_subnet.ci.id
}
}
automatic_instance_repair {
enabled = true
grace_period = "PT10M"
}
tags = {
Role = "ci-burst"
Priority = "Spot"
}
}
The cost math for this hybrid pattern:
All on-demand (10 x D4s_v5):
10 x $140/month = $1,401/month
Hybrid (2 baseline + 8 Spot):
Baseline: 2 x D2s_v5 x $70/month = $140
Spot: 8 x D4s_v5 x ~$27/month = $216
Total: = $356/month
Monthly savings: $1,045
Annual savings: $12,540 💰
You keep 2 guaranteed runners that are always available for critical builds. The 8 Spot runners handle parallel workloads at ~81% discount. If Spot capacity drops, you lose burst capacity but never lose your baseline. Builds queue up slightly but never stop completely.
🔍 Step 4: AKS Spot Node Pools
Running Kubernetes? Add a Spot node pool for non-critical workloads:
resource "azurerm_kubernetes_cluster_node_pool" "spot" {
name = "spotpool"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = "Standard_D4s_v5"
node_count = 3
priority = "Spot"
eviction_policy = "Delete"
spot_max_price = -1
os_disk_type = "Ephemeral"
os_disk_size_gb = 128
auto_scaling_enabled = true
min_count = 1
max_count = 10
node_labels = {
"kubernetes.azure.com/scalesetpriority" = "spot"
}
node_taints = [
"kubernetes.azure.com/scalesetpriority=spot:NoSchedule"
]
tags = {
Priority = "Spot"
}
}
Use Kubernetes tolerations and node selectors in your deployments to schedule fault-tolerant workloads on Spot nodes while keeping critical services on the regular node pool. The NoSchedule taint prevents pods from accidentally landing on Spot nodes unless they explicitly opt in.
⚡ Quick Audit: Check Your Spot VM Opportunities
# List all VMs and their priority (Regular vs Spot)
az vm list -d \
--query "[].{Name:name, Size:hardwareProfile.vmSize, Priority:priority, State:powerState, RG:resourceGroup}" \
--output table
# Check spot pricing for a specific SKU
az rest --method get \
--url "https://management.azure.com/subscriptions/{sub-id}/providers/Microsoft.Compute/spotPlacementRecommender/generate?api-version=2024-07-01" 2>/dev/null || \
echo "Use Azure Portal > Create VM > Spot > View pricing history for eviction rates"
# Find VMs tagged as dev/test that could be Spot candidates
az vm list \
--query "[?tags.Environment=='dev' || tags.Environment=='test'].{Name:name, Size:hardwareProfile.vmSize, Priority:priority}" \
--output table
Every VM in that last output that shows priority: null (Regular) is a candidate for Spot. Dev and test VMs almost always qualify. 🔥
💡 Architect Pro Tips
Set
max_bid_price = -1for maximum uptime. This means "pay up to the regular PAYG rate." You only get evicted for capacity reasons, not price spikes. In practice, Spot prices are quite stable for most SKUs. Setting a specific price cap increases your eviction risk significantly.Use
eviction_policy = "Delete"for stateless workloads. Deallocated VMs still incur disk storage costs and count against your quota. For CI runners, batch workers, and other stateless jobs,Deleteis cleaner and cheaper.Diversify VM sizes to increase availability. Instead of requesting 10 Spot D4s_v5 VMs, consider accepting D4s_v5, D4as_v5, and D4ds_v5. Different SKUs pull from different capacity pools, reducing your eviction risk.
B-series VMs don't support Spot pricing. If you're already running cheap B-series for dev, you can't make them cheaper with Spot. Spot is most impactful for D-series and larger SKUs where the discount is substantial.
Start small: 10-20% of your workload on Spot. Monitor eviction rates for 2-4 weeks, then scale up. This gives you real data on how often evictions happen for your chosen SKU and region.
Use Azure Scheduled Events for graceful shutdown. Your applications can poll the Instance Metadata Service (IMDS) endpoint to detect pending evictions and save state before the 30-second deadline.
Check eviction rates before committing. In the Azure Portal, go to Create VM, select Spot, then click "View pricing history and compare prices in nearby regions." This shows historical eviction rates and pricing trends. Pick SKUs and regions with lower eviction rates.
Combine Spot with Azure Hybrid Benefit. If you're running Windows VMs, you still save on the Spot discount AND the licensing cost. The two benefits stack.
📊 TL;DR
| Workload | Architecture | On-Demand Cost | With Spot | Annual Savings |
|---|---|---|---|---|
| 10 CI runners | All Spot VMSS | $1,401/mo | $270/mo | $13,572 |
| 10 CI runners | 2 baseline + 8 Spot | $1,401/mo | $356/mo | $12,540 |
| 5 batch workers | All Spot | $701/mo | $135/mo | $6,792 |
| 3 AKS Spot nodes | Spot node pool | $420/mo | $81/mo | $4,068 |
| Dev environment | Single Spot VMs | $280/mo | $54/mo | $2,712 |
Bottom line: Spot VMs are the single largest discount available on Azure compute: up to 90% off. The only cost is accepting the possibility of eviction. For any workload that can retry, checkpoint, or degrade gracefully, there's no reason to pay full price. Build a hybrid architecture with on-demand baseline and Spot burst capacity, and you get reliability with massive savings. 🎯
Go check your dev/test VMs right now. Count how many are running as Regular priority when they could be Spot. Multiply the count by ~$100/month savings per VM. That's money you're leaving on the table every single month. 😀
This is Part 8 of the "Save on Azure with Terraform" series. Next up: FinOps as Code 🧮. Building a complete cost governance framework with Terraform and Azure Policy. 💬
Top comments (0)