TL;DR:
Azure VM became completely inaccessible:
- SSH hung forever
- RunCommand was locked
- VM agent was corrupted
- OS disk had already been swapped once
The fix wasn’t another hack — it was a controlled surgical migration to a fresh VM identity using the same OS disk.
This post is the full battle log.
🧩 The Situation
We were running a monitoring server on Azure (Ubuntu).
One day:
- SSH stopped responding (connection established → banner timeout)
- Azure
RunCommandextension locked itself permanently - VM agent entered a zombie state
- Repair VM flow was used
- OS disk was restored
- And yet… the VM was still broken
At this point the machine was logically alive but operationally dead.
This is the worst kind of outage:
The VM exists, the disk exists, but the control plane is poisoned.
🧠 Why This Happens
Azure VMs have two halves:
| Layer | What it Contains |
|---|---|
| Guest OS | Your files, apps, data, users |
| Control Plane | VM identity, NIC metadata, agent channels, extensions |
Disk repair only fixes the guest OS.
When control plane metadata breaks — the VM is permanently unstable.
This is what happened.
🧯 What We Tried (And Why It Still Failed)
We tried everything:
- VM Repair workflow
- OS disk swap
- SSH daemon reset
- Firewall rebuild
- iptables flush
- VM agent resets
- RunCommand agent rebuild
- Forced deallocation
- Extension deletion
- Boot diagnostics
Every symptom disappeared… except the real problem:
The VM identity itself was corrupted.
🛠 The Correct Solution: Lift-and-Shift the OS Disk
When Azure VM identity breaks, you do not fix the VM.
You replace it.
Here’s the exact migration procedure.
🧱 Step-By-Step Recovery
1️⃣ Deallocate the broken VM
az vm deallocate -g prod -n monitoring-server
2️⃣ Detach the OS Disk
osDisk=$(az vm show -g prod -n monitoring-server --query "storageProfile.osDisk.name" -o tsv)
az vm update -g prod -n monitoring-server --remove storageProfile.osDisk
3️⃣ Create a Fresh VM Using the Same Disk
az vm create \
-g prod \
-n monitoring-server-v2 \
--attach-os-disk $osDisk \
--os-type Linux \
--size Standard_D2s_v3 \
--public-ip-sku Standard \
--vnet-name prod-vnet \
--subnet default \
--nsg prod-nsg \
--admin-username azureuser \
--generate-ssh-keys
This boots the entire original system on a clean VM identity.
4️⃣ SSH Into the New Server
ssh azureuser@<NEW_PUBLIC_IP>
💥 The server comes up instantly.
All services intact.
No corruption.
No agent problems.
No SSH issues.
5️⃣ Delete the Broken VM
After verifying everything:
az vm delete -g prod -n monitoring-server --yes
🧬 Why This Works
The disk is never the problem.
The problem is VM metadata rot:
- Corrupted agent channels
- Broken extension registry
- Stuck execution locks
- Damaged NIC bindings
- Poisoned provisioning state
By re-hydrating the OS disk on a new VM object, you bypass all of it.
This is the same internal process Azure support uses.
🧭 Lessons Learned
- Disk ≠ VM
- Repair fixes OS, not control plane
- SSH problems after disk swap are usually identity corruption
- Don’t waste days debugging a poisoned VM
- Rebuild the VM, not the disk
🧨 Final Thought
When your VM is half-alive and nothing makes sense anymore…
Stop fixing it.
Move the brain (OS disk) into a new body (VM).
This approach saved my production monitoring stack and cut downtime from days to minutes.
If this saved you — share it.
Someone else is debugging this exact nightmare right now.
Top comments (0)