Suhas Mallesh

Posted on Feb 19

Lights Out! 🌙 Your Dev VMs Run Full Power 24/7 But Your Devs Work 8 Hours - Scale Down and Stop Overpaying

#azure #devops #terraform #cloud

Dev and test VMs running at full capacity around the clock cost you 3x more than they should. Here's how to auto-scale Azure VMs to minimum during off hours with Terraform - saving 50-65% without blocking anyone.

A 10-person dev team. 10 VMs running Standard_D4s_v5 at full power. All 24/7. Monthly cost: $1,382. What if they auto-scaled to Standard_B2s during off hours? Monthly cost: $594. That's $788/month saved and late-night devs can still work🌙

Here's the math:

Standard_D4s_v5 (4 vCPU, 16 GB) = $0.192/hour  (business hours SKU)
Standard_B2s    (2 vCPU, 4 GB)   = $0.042/hour  (off-hours minimum SKU)

Business hours (10hrs x 22 weekdays):  $0.192 x 220 = $42.24/VM/month
Off hours (remaining 510 hours):       $0.042 x 510 = $21.42/VM/month
Total per VM:                          $63.66/month

vs. 24/7 at full power:               $0.192 x 730 = $140.16/VM/month

10 VMs: $636 vs $1,401 = $765/month saved
Annual savings: $9,180 🤯

And the VMs never go offline. A developer working late or over a weekend still has access, just on a smaller instance. No tickets, no manual startups, no blocked work.

The key principle: never shut down to zero. Scale to minimum. Let's build it. ⚡

🎯 Why Scale Down Instead of Shut Down?

Full shutdown sounds great on paper. In practice, it creates problems:

❌ Full shutdown = Developer at 10 PM can't access their VM
❌ Full shutdown = Startup time of 2-5 minutes when VM restarts
❌ Full shutdown = Running processes, sessions, and state are lost
❌ Full shutdown = Teams in other timezones are blocked

✅ Scale to minimum = VM stays accessible 24/7
✅ Scale to minimum = No startup delay, it's already running
✅ Scale to minimum = Processes keep running (after a quick reboot)
✅ Scale to minimum = Works for global teams across timezones

The trade-off? You still pay something during off hours. But a Standard_B2s at $0.042/hr is 78% cheaper than a Standard_D4s_v5 at $0.192/hr. For most dev/test workloads, that's the sweet spot.

🤖 Approach 1: Auto-Resize Individual VMs with Azure Automation

For standalone VMs (not in a Scale Set), use an Azure Automation Runbook that resizes VMs based on tags. The VM reboots briefly during resize, then comes back at the smaller size.

Step 1: Deploy the Automation Account

# schedules/vm-auto-resize/main.tf

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}

provider "azurerm" {
  features {}
}

data "azurerm_subscription" "current" {}

resource "azurerm_resource_group" "automation" {
  name     = "rg-vm-autoscale"
  location = "eastus"

  tags = {
    Environment = "shared"
    CostCenter  = "platform"
    Owner       = "team-platform"
    Project     = "cost-governance"
    ManagedBy   = "terraform"
  }
}

resource "azurerm_automation_account" "vm_scaler" {
  name                = "aa-vm-auto-resize"
  location            = azurerm_resource_group.automation.location
  resource_group_name = azurerm_resource_group.automation.name
  sku_name            = "Basic"

  identity {
    type = "SystemAssigned"
  }

  tags = azurerm_resource_group.automation.tags
}

# Least-privilege: can read, resize, start, and deallocate VMs
resource "azurerm_role_definition" "vm_resize_operator" {
  name        = "VM Resize Operator"
  scope       = data.azurerm_subscription.current.id
  description = "Can read and resize VMs. Nothing else."

  permissions {
    actions = [
      "Microsoft.Compute/virtualMachines/read",
      "Microsoft.Compute/virtualMachines/write",
      "Microsoft.Compute/virtualMachines/start/action",
      "Microsoft.Compute/virtualMachines/powerOff/action",
      "Microsoft.Compute/virtualMachines/deallocate/action",
      "Microsoft.Compute/virtualMachines/instanceView/read",
      "Microsoft.Resources/subscriptions/resourceGroups/read",
    ]
    not_actions = []
  }

  assignable_scopes = [data.azurerm_subscription.current.id]
}

resource "azurerm_role_assignment" "automation_vm_resize" {
  scope              = data.azurerm_subscription.current.id
  role_definition_id = azurerm_role_definition.vm_resize_operator.role_definition_resource_id
  principal_id       = azurerm_automation_account.vm_scaler.identity[0].principal_id
}

Step 2: Deploy the Resize Runbook

This runbook reads two tags from each VM: ScaleUpSize (business hours SKU) and ScaleDownSize (off-hours minimum SKU). It resizes VMs that have the AutoSchedule tag.

resource "azurerm_automation_runbook" "vm_resize" {
  name                    = "Resize-VMs-By-Tag"
  location                = azurerm_resource_group.automation.location
  resource_group_name     = azurerm_resource_group.automation.name
  automation_account_name = azurerm_automation_account.vm_scaler.name
  log_verbose             = false
  log_progress            = false
  runbook_type            = "PowerShell72"

  content = <<-POWERSHELL
    Param(
      [Parameter(Mandatory = $true)]
      [ValidateSet("ScaleUp", "ScaleDown")]
      [String] $Action
    )

    # Connect using Managed Identity
    Disable-AzContextAutosave -Scope Process
    $AzureContext = (Connect-AzAccount -Identity).context
    $AzureContext = Set-AzContext -SubscriptionName $AzureContext.Subscription -DefaultProfile $AzureContext

    # Find all VMs with the AutoSchedule tag
    $vms = Get-AzVM | Where-Object {
      $_.Tags.ContainsKey("AutoSchedule") -and
      $_.Tags.ContainsKey("ScaleUpSize") -and
      $_.Tags.ContainsKey("ScaleDownSize")
    }

    Write-Output "Found $($vms.Count) VMs with AutoSchedule tags"

    foreach ($vm in $vms) {
      $vmName = $vm.Name
      $rgName = $vm.ResourceGroupName
      $currentSize = $vm.HardwareProfile.VmSize

      if ($Action -eq "ScaleDown") {
        $targetSize = $vm.Tags["ScaleDownSize"]
      } else {
        $targetSize = $vm.Tags["ScaleUpSize"]
      }

      if ($currentSize -eq $targetSize) {
        Write-Output "SKIP: $vmName is already $currentSize"
        continue
      }

      Write-Output "$Action : $vmName from $currentSize to $targetSize..."

      try {
        $vm.HardwareProfile.VmSize = $targetSize
        Update-AzVM -ResourceGroupName $rgName -VM $vm
        Write-Output "SUCCESS: $vmName resized to $targetSize"
      }
      catch {
        Write-Output "ERROR: Failed to resize $vmName - $($_.Exception.Message)"
        # VM may need stop/start if sizes are in different families
        try {
          Write-Output "Attempting stop-resize-start for $vmName..."
          Stop-AzVM -ResourceGroupName $rgName -Name $vmName -Force
          $vm.HardwareProfile.VmSize = $targetSize
          Update-AzVM -ResourceGroupName $rgName -VM $vm
          Start-AzVM -ResourceGroupName $rgName -Name $vmName
          Write-Output "SUCCESS: $vmName resized to $targetSize (with restart)"
        }
        catch {
          Write-Output "FAILED: Could not resize $vmName - $($_.Exception.Message)"
        }
      }
    }

    Write-Output "Done. Processed $($vms.Count) VMs for $Action."
  POWERSHELL

  tags = azurerm_resource_group.automation.tags
}

Step 3: Create the Schedules

# Scale UP at 8 AM on weekdays (full power for business hours)
resource "azurerm_automation_schedule" "scale_up" {
  name                    = "weekday-vm-scale-up"
  resource_group_name     = azurerm_resource_group.automation.name
  automation_account_name = azurerm_automation_account.vm_scaler.name
  frequency               = "Week"
  interval                = 1
  timezone                = "Eastern Standard Time"
  start_time              = "2026-02-19T08:00:00-05:00"
  description             = "Scale up dev/test VMs to full power on weekday mornings"

  week_days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
}

# Scale DOWN at 7 PM on weekdays (minimum for off hours)
resource "azurerm_automation_schedule" "scale_down" {
  name                    = "weekday-vm-scale-down"
  resource_group_name     = azurerm_resource_group.automation.name
  automation_account_name = azurerm_automation_account.vm_scaler.name
  frequency               = "Week"
  interval                = 1
  timezone                = "Eastern Standard Time"
  start_time              = "2026-02-19T19:00:00-05:00"
  description             = "Scale down dev/test VMs to minimum for off hours"

  week_days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
}

# Scale DOWN on weekends too (stay at minimum all weekend)
resource "azurerm_automation_schedule" "weekend_scale_down" {
  name                    = "weekend-vm-scale-down"
  resource_group_name     = azurerm_resource_group.automation.name
  automation_account_name = azurerm_automation_account.vm_scaler.name
  frequency               = "Week"
  interval                = 1
  timezone                = "Eastern Standard Time"
  start_time              = "2026-02-21T08:00:00-05:00"
  description             = "Ensure VMs stay at minimum size on weekends"

  week_days = ["Saturday", "Sunday"]
}

# Link schedules to runbook
resource "azurerm_automation_job_schedule" "scale_up" {
  resource_group_name     = azurerm_resource_group.automation.name
  automation_account_name = azurerm_automation_account.vm_scaler.name
  schedule_name           = azurerm_automation_schedule.scale_up.name
  runbook_name            = azurerm_automation_runbook.vm_resize.name

  parameters = {
    action = "ScaleUp"
  }
}

resource "azurerm_automation_job_schedule" "scale_down" {
  resource_group_name     = azurerm_resource_group.automation.name
  automation_account_name = azurerm_automation_account.vm_scaler.name
  schedule_name           = azurerm_automation_schedule.scale_down.name
  runbook_name            = azurerm_automation_runbook.vm_resize.name

  parameters = {
    action = "ScaleDown"
  }
}

resource "azurerm_automation_job_schedule" "weekend_scale_down" {
  resource_group_name     = azurerm_resource_group.automation.name
  automation_account_name = azurerm_automation_account.vm_scaler.name
  schedule_name           = azurerm_automation_schedule.weekend_scale_down.name
  runbook_name            = azurerm_automation_runbook.vm_resize.name

  parameters = {
    action = "ScaleDown"
  }
}

Step 4: Tag Your VMs to Opt In

resource "azurerm_linux_virtual_machine" "dev_api" {
  name                = "vm-api-dev"
  resource_group_name = azurerm_resource_group.dev.name
  location            = azurerm_resource_group.dev.location
  size                = "Standard_D4s_v5"  # Daytime size
  # ... other config ...

  tags = {
    Environment   = "dev"
    CostCenter    = "CC-1042"
    Owner         = "team-backend"
    Project       = "api-platform"
    AutoSchedule  = "business-hours"
    ScaleUpSize   = "Standard_D4s_v5"   # Full power: 4 vCPU, 16 GB
    ScaleDownSize = "Standard_B2s"      # Minimum: 2 vCPU, 4 GB
    ManagedBy     = "terraform"
  }
}

At 7 PM: VM resizes from D4s_v5 to B2s (brief reboot, ~60 seconds). At 8 AM: VM resizes back to D4s_v5. A developer at midnight? Still has access on the B2s. 🌙

📈 Approach 2: VMSS Schedule-Based Autoscale Profiles

For workloads running on Virtual Machine Scale Sets, you don't resize individual VMs. Instead, you define autoscale profiles with different capacity settings for business hours vs off hours.

# schedules/vmss-autoscale/main.tf

resource "azurerm_monitor_autoscale_setting" "web_app" {
  name                = "autoscale-web-app"
  resource_group_name = azurerm_resource_group.app.name
  location            = azurerm_resource_group.app.location
  target_resource_id  = azurerm_linux_virtual_machine_scale_set.web_app.id
  enabled             = true

  # Business Hours Profile (weekdays 8 AM - 7 PM)
  profile {
    name = "business-hours"

    capacity {
      minimum = 3
      maximum = 10
      default = 3
    }

    recurrence {
      timezone = "Eastern Standard Time"
      days     = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
      hours    = [8]
      minutes  = [0]
    }

    # Scale out when CPU > 70%
    rule {
      metric_trigger {
        metric_name        = "Percentage CPU"
        metric_resource_id = azurerm_linux_virtual_machine_scale_set.web_app.id
        time_grain         = "PT1M"
        statistic          = "Average"
        time_window        = "PT5M"
        time_aggregation   = "Average"
        operator           = "GreaterThan"
        threshold          = 70
      }
      scale_action {
        direction = "Increase"
        type      = "ChangeCount"
        value     = "1"
        cooldown  = "PT5M"
      }
    }

    # Scale in when CPU < 25%
    rule {
      metric_trigger {
        metric_name        = "Percentage CPU"
        metric_resource_id = azurerm_linux_virtual_machine_scale_set.web_app.id
        time_grain         = "PT1M"
        statistic          = "Average"
        time_window        = "PT5M"
        time_aggregation   = "Average"
        operator           = "LessThan"
        threshold          = 25
      }
      scale_action {
        direction = "Decrease"
        type      = "ChangeCount"
        value     = "1"
        cooldown  = "PT5M"
      }
    }
  }

  # Off Hours Profile (weekday evenings)
  profile {
    name = "off-hours-minimum"

    capacity {
      minimum = 1      # Never zero! At least 1 instance always running
      maximum = 3      # Cap maximum to prevent accidental scale-out
      default = 1      # Scale down to 1 instance
    }

    recurrence {
      timezone = "Eastern Standard Time"
      days     = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
      hours    = [19]
      minutes  = [0]
    }

    # Still allow scale-out if needed (late-night traffic spike)
    rule {
      metric_trigger {
        metric_name        = "Percentage CPU"
        metric_resource_id = azurerm_linux_virtual_machine_scale_set.web_app.id
        time_grain         = "PT1M"
        statistic          = "Average"
        time_window        = "PT5M"
        time_aggregation   = "Average"
        operator           = "GreaterThan"
        threshold          = 80
      }
      scale_action {
        direction = "Increase"
        type      = "ChangeCount"
        value     = "1"
        cooldown  = "PT10M"
      }
    }

    rule {
      metric_trigger {
        metric_name        = "Percentage CPU"
        metric_resource_id = azurerm_linux_virtual_machine_scale_set.web_app.id
        time_grain         = "PT1M"
        statistic          = "Average"
        time_window        = "PT5M"
        time_aggregation   = "Average"
        operator           = "LessThan"
        threshold          = 20
      }
      scale_action {
        direction = "Decrease"
        type      = "ChangeCount"
        value     = "1"
        cooldown  = "PT10M"
      }
    }
  }

  # Weekend Profile
  profile {
    name = "weekend-minimum"

    capacity {
      minimum = 1      # Never zero!
      maximum = 2
      default = 1
    }

    recurrence {
      timezone = "Eastern Standard Time"
      days     = ["Saturday", "Sunday"]
      hours    = [0]
      minutes  = [0]
    }
  }

  notification {
    email {
      send_to_subscription_administrator    = false
      send_to_subscription_co_administrator = false
      custom_emails                         = ["finops@company.com"]
    }
  }
}

Key design decisions:

Business hours: min 3, max 10 (full autoscaling)
Off hours: min 1, max 3 (reduced but never zero)
Weekends: min 1, max 2 (skeleton crew)
Off-hours still allow scale-out if CPU spikes (for that late-night deployment) 🎯

⚡ Quick Audit: Find VMs Running at Full Power Without Schedules

# Find ALL running non-prod VMs without AutoSchedule tag
az vm list --show-details \
  --query "[?powerState=='VM running' && tags.Environment!='prod' && tags.AutoSchedule==null].{
    Name:name,
    Size:hardwareProfile.vmSize,
    RG:resourceGroup,
    Environment:tags.Environment
  }" --output table

# Check which VMSS have autoscale configured
az monitor autoscale list \
  --query "[].{Name:name, Enabled:enabled, ResourceGroup:resourceGroup}" \
  --output table

Every non-prod VM in that first list running at full size 24/7 is wasting 50-65% of its compute budget. 🔥

💡 Architect Pro Tips

Stay within the same VM family for resizing. Resizing from Standard_D4s_v5 to Standard_B2s may require a brief stop/start because they're different families. Resizing within the same family (D4s_v5 to D2s_v5) can sometimes happen without a reboot. The runbook handles both cases.
B-series VMs are ideal for off-hours minimums. The B-series is "burstable," meaning they accumulate CPU credits when idle. A dev who logs in at midnight gets burst performance from accumulated credits. Perfect for occasional off-hours use.
Never set VMSS minimum to 0. Even if you think nobody will use it, a minimum of 1 ensures zero cold-start latency. Scaling from 0 to 1 takes minutes. Scaling from 1 to 3 takes seconds. That one instance is your insurance policy.
Disk costs don't change with VM resize. Managed disks are billed regardless of VM size or state. This strategy saves on compute (CPU/RAM) costs, not storage. Combine with disk optimization for additional savings.
Azure Automation pricing is nearly free. You get 500 minutes of free runbook execution per month. A resize job running twice daily for 20 VMs uses around 30 minutes total. The cost of the automation itself is negligible.
Test your resize path first. Before enabling schedules, manually resize one VM from your full-power SKU to your minimum SKU and back. Confirm your application handles the transition gracefully. Some apps may need a service restart script in the runbook.

📊 TL;DR

Action	Savings	Availability Impact
Resize VMs to B2s off hours (Approach 1)	~55% on non-prod compute	Brief reboot (~60s) during resize
VMSS schedule profiles (Approach 2)	~50-65% on VMSS compute	Zero downtime (gradual scale)
Weekend minimum (both approaches)	Additional ~20% savings	Always-on at minimum capacity

The savings math for a typical dev team:

Schedule Strategy	Monthly Cost (10 VMs)	vs. 24/7
24/7 full power (no schedule)	$1,401	Baseline
Scale to B2s off hours + weekends	$636	Save $765/mo
Scale within same family (D2s_v5)	$784	Save $617/mo

Bottom line: Scaling to minimum during off hours captures 70-80% of the savings you'd get from full shutdown, with none of the availability problems. Your late-night developers, your weekend deployers, and your distributed teams across timezones all keep working. Deploy this alongside the tagging (Part 1) and budget alerts (Part 2) for a complete cost governance stack. 🌙

Run that audit command. Count your non-prod VMs without an AutoSchedule tag. Multiply each one by $75/month in potential savings. That's money you're leaving on the table tonight. 😏

This is Part 3 of the "Save on Azure with Terraform" series. Next up: Your Cloud Bill Has Ghosts 👻. Finding and destroying orphaned Azure resources that are quietly billing you every month. 💬

DEV Community