Incus Cluster Gone: When a Dead SD Card Wipes Your Infra

What happens when you run an Incus cluster with two nodes, and one of the nodes goes into a disaster — like an SD card corruption — and your .img backup doesn't help after recovery?

Step-by-step Breakdown
After realizing that the second node (with the dead SD card) couldn't boot anymore, I began recovery on the surviving node.

Check the cluster

incus cluster list

From here, I identified the broken node name (e.g. node-broken) and noted which containers were located there.

Try removing the broken node

incus cluster remove broken-node

But got an error:

Error: Node still has the following instances: a1, a2, a3, b4, b5

Tried to delete those instances manually

incus delete a1

incus delete a1 --force

Still failed:

Error: Failed checking instance exists "local:a1": Missing event connection with target cluster member``

Final Fix: Using SQL to Delete Orphaned Instances
At this point, the only way was to manually remove the leftover metadata using the Incus admin SQL tool.

incus admin sql global "SELECT name, node_id FROM instances WHERE name='a1';"

incus admin sql global "SELECT name, node_id FROM instances WHERE name='a1';"

====== This part partially success [ kiv ] ===

incus cluster remove broken-node

Error: Delete "https://192.168.xxx.xxx:8443/1.0/storage-pools/local": Unable to connect to:

incus admin sql global "SELECT id, name FROM nodes;"
incus admin sql global "SELECT id, name FROM storage_pools;"

nodes:
1|node-alive
2|node-dead

storage_pools:
1|local

incus admin sql global "DELETE FROM storage_pools_nodes WHERE node_id=2 AND storage_pool_id=1;"

incus admin sql global "SELECT id, name, type, project_id FROM storage_volumes WHERE node_id=2;"

Haven't tried importing the backup from the MinIO server yet

DEV Community

Incus Cluster Gone: When a Dead SD Card Wipes Your Infra

Top comments (0)