DEV Community

Prince Raj
Prince Raj

Posted on

🚨 How We Rescued a Dead Azure Linux VM After SSH, Agent, and OS Disk All Broke (A Real Production War Story)

TL;DR:

Azure VM became completely inaccessible:

  • SSH hung forever
  • RunCommand was locked
  • VM agent was corrupted
  • OS disk had already been swapped once

The fix wasn’t another hack — it was a controlled surgical migration to a fresh VM identity using the same OS disk.

This post is the full battle log.


🧩 The Situation

We were running a monitoring server on Azure (Ubuntu).

One day:

  • SSH stopped responding (connection established → banner timeout)
  • Azure RunCommand extension locked itself permanently
  • VM agent entered a zombie state
  • Repair VM flow was used
  • OS disk was restored
  • And yet… the VM was still broken

At this point the machine was logically alive but operationally dead.

This is the worst kind of outage:

The VM exists, the disk exists, but the control plane is poisoned.


🧠 Why This Happens

Azure VMs have two halves:

Layer What it Contains
Guest OS Your files, apps, data, users
Control Plane VM identity, NIC metadata, agent channels, extensions

Disk repair only fixes the guest OS.

When control plane metadata breaks — the VM is permanently unstable.

This is what happened.


🧯 What We Tried (And Why It Still Failed)

We tried everything:

  • VM Repair workflow
  • OS disk swap
  • SSH daemon reset
  • Firewall rebuild
  • iptables flush
  • VM agent resets
  • RunCommand agent rebuild
  • Forced deallocation
  • Extension deletion
  • Boot diagnostics

Every symptom disappeared… except the real problem:

The VM identity itself was corrupted.


🛠 The Correct Solution: Lift-and-Shift the OS Disk

When Azure VM identity breaks, you do not fix the VM.

You replace it.

Here’s the exact migration procedure.


🧱 Step-By-Step Recovery


1️⃣ Deallocate the broken VM

az vm deallocate -g prod -n monitoring-server
Enter fullscreen mode Exit fullscreen mode

2️⃣ Detach the OS Disk

osDisk=$(az vm show -g prod -n monitoring-server --query "storageProfile.osDisk.name" -o tsv)

az vm update -g prod -n monitoring-server --remove storageProfile.osDisk
Enter fullscreen mode Exit fullscreen mode

3️⃣ Create a Fresh VM Using the Same Disk

az vm create \
  -g prod \
  -n monitoring-server-v2 \
  --attach-os-disk $osDisk \
  --os-type Linux \
  --size Standard_D2s_v3 \
  --public-ip-sku Standard \
  --vnet-name prod-vnet \
  --subnet default \
  --nsg prod-nsg \
  --admin-username azureuser \
  --generate-ssh-keys
Enter fullscreen mode Exit fullscreen mode

This boots the entire original system on a clean VM identity.


4️⃣ SSH Into the New Server

ssh azureuser@<NEW_PUBLIC_IP>
Enter fullscreen mode Exit fullscreen mode

💥 The server comes up instantly.
All services intact.
No corruption.
No agent problems.
No SSH issues.


5️⃣ Delete the Broken VM

After verifying everything:

az vm delete -g prod -n monitoring-server --yes
Enter fullscreen mode Exit fullscreen mode

🧬 Why This Works

The disk is never the problem.
The problem is VM metadata rot:

  • Corrupted agent channels
  • Broken extension registry
  • Stuck execution locks
  • Damaged NIC bindings
  • Poisoned provisioning state

By re-hydrating the OS disk on a new VM object, you bypass all of it.

This is the same internal process Azure support uses.


🧭 Lessons Learned

  1. Disk ≠ VM
  2. Repair fixes OS, not control plane
  3. SSH problems after disk swap are usually identity corruption
  4. Don’t waste days debugging a poisoned VM
  5. Rebuild the VM, not the disk

🧨 Final Thought

When your VM is half-alive and nothing makes sense anymore…

Stop fixing it.
Move the brain (OS disk) into a new body (VM).

This approach saved my production monitoring stack and cut downtime from days to minutes.

If this saved you — share it.
Someone else is debugging this exact nightmare right now.


Happy debugging 👨‍💻🔥

Top comments (0)