Prince Raj

Posted on Jan 9

🚨 How We Rescued a Dead Azure Linux VM After SSH, Agent, and OS Disk All Broke (A Real Production War Story)

#azure #linux #sre

TL;DR:

Azure VM became completely inaccessible:

SSH hung forever

RunCommand was locked

VM agent was corrupted

OS disk had already been swapped once

The fix wasn’t another hack — it was a controlled surgical migration to a fresh VM identity using the same OS disk.

This post is the full battle log.

🧩 The Situation

We were running a monitoring server on Azure (Ubuntu).

One day:

SSH stopped responding (connection established → banner timeout)
Azure RunCommand extension locked itself permanently
VM agent entered a zombie state
Repair VM flow was used
OS disk was restored
And yet… the VM was still broken

At this point the machine was logically alive but operationally dead.

This is the worst kind of outage:

The VM exists, the disk exists, but the control plane is poisoned.

🧠 Why This Happens

Azure VMs have two halves:

Layer	What it Contains
Guest OS	Your files, apps, data, users
Control Plane	VM identity, NIC metadata, agent channels, extensions

Disk repair only fixes the guest OS.

When control plane metadata breaks — the VM is permanently unstable.

This is what happened.

🧯 What We Tried (And Why It Still Failed)

We tried everything:

VM Repair workflow
OS disk swap
SSH daemon reset
Firewall rebuild
iptables flush
VM agent resets
RunCommand agent rebuild
Forced deallocation
Extension deletion
Boot diagnostics

Every symptom disappeared… except the real problem:

The VM identity itself was corrupted.

🛠 The Correct Solution: Lift-and-Shift the OS Disk

When Azure VM identity breaks, you do not fix the VM.

You replace it.

Here’s the exact migration procedure.

🧱 Step-By-Step Recovery

1️⃣ Deallocate the broken VM

az vm deallocate -g prod -n monitoring-server

2️⃣ Detach the OS Disk

osDisk=$(az vm show -g prod -n monitoring-server --query "storageProfile.osDisk.name" -o tsv)

az vm update -g prod -n monitoring-server --remove storageProfile.osDisk

3️⃣ Create a Fresh VM Using the Same Disk

az vm create \
  -g prod \
  -n monitoring-server-v2 \
  --attach-os-disk $osDisk \
  --os-type Linux \
  --size Standard_D2s_v3 \
  --public-ip-sku Standard \
  --vnet-name prod-vnet \
  --subnet default \
  --nsg prod-nsg \
  --admin-username azureuser \
  --generate-ssh-keys

This boots the entire original system on a clean VM identity.

4️⃣ SSH Into the New Server

ssh azureuser@<NEW_PUBLIC_IP>

💥 The server comes up instantly.
All services intact.
No corruption.
No agent problems.
No SSH issues.

5️⃣ Delete the Broken VM

After verifying everything:

az vm delete -g prod -n monitoring-server --yes

🧬 Why This Works

The disk is never the problem.
The problem is VM metadata rot:

Corrupted agent channels
Broken extension registry
Stuck execution locks
Damaged NIC bindings
Poisoned provisioning state

By re-hydrating the OS disk on a new VM object, you bypass all of it.

This is the same internal process Azure support uses.

🧭 Lessons Learned

Disk ≠ VM
Repair fixes OS, not control plane
SSH problems after disk swap are usually identity corruption
Don’t waste days debugging a poisoned VM
Rebuild the VM, not the disk

🧨 Final Thought

When your VM is half-alive and nothing makes sense anymore…

Stop fixing it.
Move the brain (OS disk) into a new body (VM).

This approach saved my production monitoring stack and cut downtime from days to minutes.

If this saved you — share it.
Someone else is debugging this exact nightmare right now.

DEV Community