Right-Size or Pay the Price 📏 VM SKU Optimization Strategy with Terraform

#terraform #devops #cloud #azure

That D8s_v5 running at 12% CPU is costing you 4x what you need. Here's how to use Azure Advisor data, build a workload-to-SKU mapping module in Terraform, and stop over-provisioning VMs across every environment.

An audit of 40 Azure VMs across three environments reveals: average CPU utilization is 11%, average memory usage is 23%. Half the fleet is running D4s_v5 (4 vCPU, 16 GB) when a B2s (2 vCPU, 4 GB) would handle the workload fine. The overspend: $2,100/month. Annual waste: $25,200 - from just 40 VMs. 📏

Here's the pricing reality for common Azure VM sizes (Linux, East US, pay-as-you-go):

Standard_D8s_v5   8 vCPU, 32 GB   $0.384/hr   $280/month
Standard_D4s_v5   4 vCPU, 16 GB   $0.192/hr   $140/month
Standard_D2s_v5   2 vCPU,  8 GB   $0.096/hr   $70/month
Standard_B2s      2 vCPU,  4 GB   $0.042/hr   $31/month
Standard_B2ms     2 vCPU,  8 GB   $0.083/hr   $61/month

A single VM running D4s_v5 at 12% CPU wastes roughly $109/month compared to a B2ms that could handle the same load. Multiply that across a fleet and you're looking at serious money.

The problem? Most teams pick a VM size during initial deployment and never revisit it. "It works, don't touch it." Meanwhile, the workload settled at 10% CPU months ago and nobody noticed. Let's fix that with data and automation. ⚡

🎯 Step 1: Find the Over-Provisioned VMs

Azure Advisor: Your Free Right-Sizing Scout

Azure Advisor monitors VM utilization for 7 days (configurable) and flags underutilized VMs. The default threshold is 5% CPU, but you can set it to 5%, 10%, 15%, or 20%.

# Get Advisor right-sizing recommendations
az advisor recommendation list \
  --category Cost \
  --query "[?shortDescription.problem=='Right-size or shutdown underutilized virtual machines'].{
    VM:resourceMetadata.resourceId,
    CurrentSKU:extendedProperties.currentSku,
    RecommendedSKU:extendedProperties.targetSku,
    AnnualSavings:extendedProperties.annualSavingsAmount
  }" --output table

This gives you an instant hit list of VMs that Advisor has identified as oversized, along with specific SKU recommendations and dollar savings.

Azure Monitor: The 30-Day Deep Dive

Advisor only looks at 7 days by default. For a more accurate picture, check 30+ days of metrics to account for weekly cycles and monthly peaks:

# Check average CPU over last 30 days for a specific VM
az monitor metrics list \
  --resource "/subscriptions/<SUB_ID>/resourceGroups/<RG>/providers/Microsoft.Compute/virtualMachines/<VM_NAME>" \
  --metric "Percentage CPU" \
  --interval PT1H \
  --start-time $(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --aggregation Average Maximum \
  --query "value[0].timeseries[0].data[].{Time:timeStamp, AvgCPU:average, MaxCPU:maximum}" \
  --output table

# Quick scan: find all VMs and their sizes
az vm list \
  --query "[].{Name:name, Size:hardwareProfile.vmSize, RG:resourceGroup}" \
  --output table

The right-sizing decision matrix:

Avg CPU	Peak CPU	Avg Memory	Action
<5%	<20%	<20%	Downsize 2 tiers or switch to B-series
5-15%	<40%	<40%	Downsize 1 tier
15-40%	<70%	<70%	Likely right-sized, monitor
>40%	>80%	>80%	Consider upsizing

🏗️ Step 2: The Workload-to-SKU Mapping Module

Instead of letting every team pick VM sizes ad-hoc, build a Terraform module that maps workload types and environments to pre-approved, cost-optimized SKUs:

# modules/vm-rightsized/variables.tf

variable "workload_type" {
  type        = string
  description = "Type of workload: web, api, worker, database, ci_runner, monitoring"

  validation {
    condition = contains(
      ["web", "api", "worker", "database", "ci_runner", "monitoring"],
      var.workload_type
    )
    error_message = "Must be: web, api, worker, database, ci_runner, or monitoring."
  }
}

variable "environment" {
  type        = string
  description = "Environment: dev, staging, prod"

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Must be: dev, staging, or prod."
  }
}

variable "size_override" {
  type        = string
  default     = null
  description = "Override the auto-selected SKU. Requires justification tag."
}

variable "name" { type = string }
variable "resource_group_name" { type = string }
variable "location" { type = string }
variable "subnet_id" { type = string }
variable "admin_username" { type = string default = "azureadmin" }
variable "admin_ssh_key" { type = string }
variable "tags" { type = map(string) default = {} }

# modules/vm-rightsized/sku_map.tf

locals {
  # ────────────────────────────────────────────
  # The SKU Decision Matrix
  # Maps workload type + environment to optimal VM size
  # ────────────────────────────────────────────
  sku_map = {
    web = {
      dev     = "Standard_B2s"      # 2 vCPU,  4 GB - $31/mo
      staging = "Standard_B2ms"     # 2 vCPU,  8 GB - $61/mo
      prod    = "Standard_D2s_v5"   # 2 vCPU,  8 GB - $70/mo
    }
    api = {
      dev     = "Standard_B2ms"     # 2 vCPU,  8 GB - $61/mo
      staging = "Standard_D2s_v5"   # 2 vCPU,  8 GB - $70/mo
      prod    = "Standard_D4s_v5"   # 4 vCPU, 16 GB - $140/mo
    }
    worker = {
      dev     = "Standard_B2s"      # 2 vCPU,  4 GB - $31/mo
      staging = "Standard_D2s_v5"   # 2 vCPU,  8 GB - $70/mo
      prod    = "Standard_D4s_v5"   # 4 vCPU, 16 GB - $140/mo
    }
    database = {
      dev     = "Standard_B2ms"     # 2 vCPU,  8 GB - $61/mo
      staging = "Standard_E2s_v5"   # 2 vCPU, 16 GB - $126/mo
      prod    = "Standard_E4s_v5"   # 4 vCPU, 32 GB - $252/mo
    }
    ci_runner = {
      dev     = "Standard_B2s"      # 2 vCPU,  4 GB - $31/mo
      staging = "Standard_B2s"      # 2 vCPU,  4 GB - $31/mo
      prod    = "Standard_F4s_v2"   # 4 vCPU,  8 GB - $170/mo
    }
    monitoring = {
      dev     = "Standard_B2s"      # 2 vCPU,  4 GB - $31/mo
      staging = "Standard_B2s"      # 2 vCPU,  4 GB - $31/mo
      prod    = "Standard_D2s_v5"   # 2 vCPU,  8 GB - $70/mo
    }
  }

  # Override or use the mapped SKU
  selected_sku = coalesce(var.size_override, local.sku_map[var.workload_type][var.environment])
}

# modules/vm-rightsized/main.tf

resource "azurerm_network_interface" "this" {
  name                = "${var.name}-nic"
  location            = var.location
  resource_group_name = var.resource_group_name

  ip_configuration {
    name                          = "internal"
    subnet_id                     = var.subnet_id
    private_ip_address_allocation = "Dynamic"
  }
  tags = var.tags
}

resource "azurerm_linux_virtual_machine" "this" {
  name                = var.name
  resource_group_name = var.resource_group_name
  location            = var.location
  size                = local.selected_sku
  admin_username      = var.admin_username

  network_interface_ids = [azurerm_network_interface.this.id]

  admin_ssh_key {
    username   = var.admin_username
    public_key = var.admin_ssh_key
  }

  os_disk {
    caching              = "ReadWrite"
    storage_account_type = var.environment == "prod" ? "Premium_LRS" : "StandardSSD_LRS"
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-jammy"
    sku       = "22_04-lts-gen2"
    version   = "latest"
  }

  tags = merge(var.tags, {
    WorkloadType = var.workload_type
    Environment  = var.environment
    VMSize       = local.selected_sku
    SizeOverride = var.size_override != null ? "true" : "false"
    ManagedBy    = "terraform"
  })
}

output "vm_id" { value = azurerm_linux_virtual_machine.this.id }
output "selected_sku" { value = local.selected_sku }
output "private_ip" { value = azurerm_network_interface.this.private_ip_address }

Usage:

# Dev API server: automatically gets B2ms ($61/mo)
module "api_dev" {
  source              = "./modules/vm-rightsized"
  name                = "vm-api-dev-01"
  resource_group_name = azurerm_resource_group.dev.name
  location            = azurerm_resource_group.dev.location
  subnet_id           = azurerm_subnet.dev.id
  admin_ssh_key       = file("~/.ssh/id_rsa.pub")
  workload_type       = "api"
  environment         = "dev"
  tags                = { CostCenter = "CC-1042", Team = "backend" }
}

# Prod database: automatically gets E4s_v5 ($252/mo)
module "db_prod" {
  source              = "./modules/vm-rightsized"
  name                = "vm-db-prod-01"
  resource_group_name = azurerm_resource_group.prod.name
  location            = azurerm_resource_group.prod.location
  subnet_id           = azurerm_subnet.prod.id
  admin_ssh_key       = file("~/.ssh/id_rsa.pub")
  workload_type       = "database"
  environment         = "prod"
  tags                = { CostCenter = "CC-1042", Team = "data" }
}

# Need a bigger SKU? Override with justification
module "api_prod_heavy" {
  source              = "./modules/vm-rightsized"
  name                = "vm-api-prod-02"
  resource_group_name = azurerm_resource_group.prod.name
  location            = azurerm_resource_group.prod.location
  subnet_id           = azurerm_subnet.prod.id
  admin_ssh_key       = file("~/.ssh/id_rsa.pub")
  workload_type       = "api"
  environment         = "prod"
  size_override       = "Standard_D8s_v5"  # Override visible in tags
  tags                = {
    CostCenter       = "CC-1042"
    Team             = "backend"
    OverrideReason   = "Black Friday traffic spike handling"
  }
}

What this gives you:

Consistent sizing across environments, no more "I picked D8s_v5 because it was the default"
Dev/staging always cheaper than prod by design
Override path with visibility (tagged as SizeOverride = true for audit)
B-series for dev saves 55-75% compared to D-series equivalents

🔍 Step 3: Azure Policy Guard Rails

Prevent expensive SKUs in non-production subscriptions with Azure Policy via Terraform:

# Deny large VM SKUs in dev/staging subscriptions
resource "azurerm_subscription_policy_assignment" "restrict_vm_sizes" {
  name                 = "restrict-vm-sizes-nonprod"
  subscription_id      = data.azurerm_subscription.nonprod.id
  policy_definition_id = "/providers/Microsoft.Authorization/policyDefinitions/cccc23c7-8427-4f53-ad12-b6a63eb452b3"
  display_name         = "Restrict VM sizes in non-production"

  parameters = jsonencode({
    listOfAllowedSKUs = {
      value = [
        "Standard_B2s",
        "Standard_B2ms",
        "Standard_B4ms",
        "Standard_D2s_v5",
        "Standard_D2as_v5",
        "Standard_E2s_v5"
      ]
    }
  })
}

Now if someone tries to deploy a D16s_v5 in the dev subscription, Azure blocks it at the ARM layer before Terraform even finishes applying. 🚫

⚡ Quick Audit: Find Your Biggest Savings Right Now

# One-liner: list all VMs sorted by size (biggest spenders first)
az vm list -d \
  --query "sort_by([].{
    Name:name,
    Size:hardwareProfile.vmSize,
    RG:resourceGroup,
    State:powerState
  }, &Size)" \
  --output table

# Get Advisor cost recommendations with savings amounts
az advisor recommendation list \
  --category Cost \
  --query "[?shortDescription.problem contains 'underutilized'].{
    Resource:shortDescription.solution,
    Savings:extendedProperties.annualSavingsAmount
  }" --output table

Run these, sort by savings amount, and start with the top 5. That's your quick win list. 🎯

💡 Architect Pro Tips

B-series is your dev/staging workhorse. B-series VMs accumulate CPU credits when idle and burst when needed. A B2ms at $61/month handles most dev workloads that would otherwise run on a D4s_v5 at $140/month. That's a 56% savings per VM.
D-series for steady production, E-series for memory-hungry workloads. D-series gives you 4 GB per vCPU (balanced). E-series gives you 8 GB per vCPU (memory-optimized). Running a database on D-series when it needs memory? You're paying for extra CPU you don't use. Switch to E-series: fewer vCPUs, more RAM, often cheaper for memory workloads.
F-series for compute-only work. CI/CD runners, batch processing, and compute-heavy tasks. F-series gives you a higher CPU-to-memory ratio at a lower price than D-series when memory isn't a concern.
AMD (Das/Eas) variants are 5-10% cheaper. The a in D2as_v5 means AMD processor. For most workloads, AMD and Intel perform identically. The AMD variants are slightly cheaper. Easy savings if you're not locked to Intel.
Resizing within the same family usually requires no reboot. Going from D4s_v5 to D2s_v5 is a quick stop/start. Cross-family changes (D-series to B-series) always require a stop/start. Plan accordingly.
Tag your overrides. The module above tags VMs with SizeOverride = true when someone overrides the recommended SKU. This makes it easy to audit which VMs are running larger than recommended and why.
Review quarterly. Workloads change. A VM that needed D4s_v5 six months ago might be running at 8% CPU today after an optimization. Make right-sizing a quarterly process, not a one-time event.

📊 TL;DR

Workload	Dev SKU	Prod SKU	Dev Monthly	Prod Monthly
Web server	B2s	D2s_v5	$31	$70
API server	B2ms	D4s_v5	$61	$140
Worker	B2s	D4s_v5	$31	$140
Database	B2ms	E4s_v5	$61	$252
CI runner	B2s	F4s_v2	$31	$170
Monitoring	B2s	D2s_v5	$31	$70

Before right-sizing (all D4s_v5 across 6 VMs per env):
Dev: 6 x $140 = $840/month | Prod: 6 x $140 = $840/month

After right-sizing (mapped SKUs):
Dev: $246/month | Prod: $842/month

Dev savings: $594/month = $7,128/year from just 6 VMs. Prod stays appropriately sized for each workload type, with databases getting memory-optimized E-series instead of overpaying for CPU on D-series. 💰

Run the Advisor audit command. Find your VMs running at <15% CPU. Calculate the savings if you dropped them one tier. That number will get your manager's attention. 😀

This is Part 7 of the "Save on Azure with Terraform" series. Next up: Spot the Savings 🎯. Running non-critical workloads on Azure Spot VMs with up to 90% savings. 💬