DEV Community

hardyweb
hardyweb

Posted on

Incus Cluster Gone: When a Dead SD Card Wipes Your Infra

What happens when you run an Incus cluster with two nodes, and one of the nodes goes into a disaster — like an SD card corruption — and your .img backup doesn't help after recovery?

Step-by-step Breakdown
After realizing that the second node (with the dead SD card) couldn't boot anymore, I began recovery on the surviving node.

Check the cluster

incus cluster list
Enter fullscreen mode Exit fullscreen mode

From here, I identified the broken node name (e.g. node-broken) and noted which containers were located there.

Try removing the broken node

incus cluster remove broken-node
Enter fullscreen mode Exit fullscreen mode

But got an error:

Error: Node still has the following instances: a1, a2, a3, b4, b5
Enter fullscreen mode Exit fullscreen mode

Tried to delete those instances manually

incus delete a1
Enter fullscreen mode Exit fullscreen mode

or

incus delete a1 --force
Enter fullscreen mode Exit fullscreen mode

Still failed:

Error: Failed checking instance exists "local:a1": Missing event connection with target cluster member``

Enter fullscreen mode Exit fullscreen mode

Final Fix: Using SQL to Delete Orphaned Instances
At this point, the only way was to manually remove the leftover metadata using the Incus admin SQL tool.

incus admin sql global "SELECT name, node_id FROM instances WHERE name='a1';"
Enter fullscreen mode Exit fullscreen mode
incus admin sql global "SELECT name, node_id FROM instances WHERE name='a1';"
Enter fullscreen mode Exit fullscreen mode

====== This part partially success [ kiv ] ===

incus cluster remove broken-node
Enter fullscreen mode Exit fullscreen mode
Error: Delete "https://192.168.xxx.xxx:8443/1.0/storage-pools/local": Unable to connect to: 
Enter fullscreen mode Exit fullscreen mode
incus admin sql global "SELECT id, name FROM nodes;"
incus admin sql global "SELECT id, name FROM storage_pools;"

Enter fullscreen mode Exit fullscreen mode
nodes:
1|node-alive
2|node-dead

storage_pools:
1|local
Enter fullscreen mode Exit fullscreen mode
incus admin sql global "DELETE FROM storage_pools_nodes WHERE node_id=2 AND storage_pool_id=1;"
Enter fullscreen mode Exit fullscreen mode
incus admin sql global "SELECT id, name, type, project_id FROM storage_volumes WHERE node_id=2;"
Enter fullscreen mode Exit fullscreen mode

Haven't tried importing the backup from the MinIO server yet

Top comments (0)