Proxmox Cluster Quorum: How Many Nodes Do You Actually Need

#proxmox #quorum #highavailability #homelab

I woke up to a cluster that had effectively turned itself into a read-only museum. My VMs were running, but I couldn't start a new one, I couldn't migrate a workload, and the Proxmox GUI was throwing "Cluster not ready - no quorum" errors across the board. I had a two-node setup, one node had rebooted for a kernel update, and the remaining node decided that since it didn't have a majority, it no longer had the right to make decisions.

If you're building a Proxmox cluster, quorum is the one concept that will either be completely invisible or the primary reason your entire infrastructure freezes. Most people treat it as a checkbox during the cluster creation wizard, but in a home lab, the math of quorum often clashes with the reality of how many physical servers you can actually fit in your rack.

What I tried first

My initial instinct was that "Cluster" simply meant "nodes that can talk to each other." I assumed that as long as one node was alive, the cluster was alive. I set up two beefy nodes, linked them together, and felt confident.

Then I hit the "split-brain" wall. In a two-node cluster, the quorum requirement is (n/2) + 1. For two nodes, that means you need two votes to have a majority. If one node goes down, the remaining node has one vote. One is not greater than one. The remaining node loses quorum and enters a protective state. It stops allowing configuration changes to prevent a scenario where both nodes think they are the master and start writing conflicting data to shared storage, which is a great way to corrupt your VM disks.

I tried to "fix" this by manually forcing quorum on the surviving node using pvecm expected 1. It worked for a few minutes, but it's a manual band-aid. Every time a node rebooted or a network cable acted up, I was back in the CLI fighting with the cluster manager. I realized I was fighting the fundamental design of Corosync, and the only way out was to change the voting math.

The actual solution

You have three real options depending on your hardware budget and your tolerance for manual intervention.

Option 1: The Three-Node Standard

The cleanest way to solve quorum is to just add a third node. With three nodes, quorum is two votes. If one node dies, two remain. You still have a majority, and HA (High Availability) actually works as intended.

Option 2: The QDevice (The "Cheap" Vote)

If you can't justify a third full-sized server, you use a Quorum Device (QDevice). A QDevice is a lightweight external voter. It doesn't run VMs; it just tells the cluster "Yes, I see Node A." You can run this on a Raspberry Pi, a tiny VM on a separate host, or even a cheap VPS.

To set up a QDevice on a separate Debian/Ubuntu machine:

# On the QDevice server (the voter)
apt update && apt install corosync-qnetd

# On all Proxmox nodes
apt update && apt install corosync-qdevice

Once the software is installed, you initialize the device from one of the Proxmox nodes:

# Run this on one PVE node
pvecm qdevice setup <IP-OF-QDEVICE-SERVER>

This adds a third vote to the cluster without requiring a third Proxmox node. Now, if one PVE node fails, the other PVE node and the QDevice provide the two votes needed to maintain quorum.

Option 3: Monitoring and API Integration

If you're running a larger setup, you shouldn't be checking quorum by clicking through the GUI. I integrated pve_exporter with Prometheus to get alerts the second a node loses its vote.

Since I'm using token-based authentication to avoid the security risks of root passwords in plain text (see my post on Proxmox API Tokens), the setup looks like this.

First, create a restricted user for the exporter:

# Create user with PVEAuditor role
pveum user add prometheus@pve --realm local --password sEcr3T! --groups PVEAuditors

# Create API token for prometheus@pve
pveum token add prometheus@pve prometheus --privsep 0

Then, configure the pve_exporter YAML:

api:
  token_name: prometheus
  token_value: prometheus@pve!prometheus

And the Prometheus scrape config to target the nodes:

- job_name: 'proxmox'
  metrics_path: /pve
  scrape_interval: 30s
  params:
    cluster: ['1']
    node: ['1']
  relabel_configs:
    - source_labels: [__address__]
      regex: '^(10\.0\.0\.\d+)$'
      target_label: __param_target
      replacement: $1
  static_configs:
    - targets: ['10.0.0.x:9221']

Why it works

Proxmox uses Corosync for cluster membership and quorum. Corosync is designed for absolute consistency over availability (the "C" in the CAP theorem). It assumes that if you can't reach a majority of your peers, you are the one who is isolated, not them.

In a two-node cluster, there is no way to distinguish between "Node B is dead" and "The network cable between Node A and Node B is unplugged." If Node A decided to stay "active" while Node B also stayed "active," and both tried to modify the same shared storage (like a Ceph pool or an NFS share), you'd end up with a corrupted filesystem.

By adding a third vote (either a node or a QDevice), you break the tie. The node that can still talk to the QDevice knows it is part of the majority. The node that is isolated knows it's alone and gracefully steps back.

Lessons learned

The biggest lesson here is that High Availability (HA) is a lie if you don't have a proper quorum strategy. I spent a week thinking I had "HA" because I had two nodes and shared storage. In reality, I had a system that would freeze the moment I tried to update a BIOS or swap a NIC.

If you're running a two-node cluster, do not rely on pvecm expected 1. It's a temporary fix for recovery, not a configuration. Get a QDevice. Even a $35 Raspberry Pi is better than a cluster that goes read-only during a midnight update.

I also found that hardware stability plays a huge role in quorum health. If you're seeing random "Node lost" messages in your logs but the server is still pingable, check your kernel settings. I've dealt with AMD Ryzen C-State freezes that looked like network failures but were actually the CPU dropping into a sleep state so deep the NIC stopped responding for a few milliseconds, triggering a Corosync timeout.

A few final caveats:

QDevice Placement: Don't run your QDevice as a VM on the same cluster it's voting for. That's circular logic. If the cluster loses quorum and the VM stops, the QDevice disappears, and you're stuck. Put it on a separate physical box or a different hypervisor.
Network Latency: Corosync is extremely sensitive to latency. If you're putting your QDevice in the cloud or on a slow Wi-Fi link, you'll see "flapping" where the cluster constantly gains and loses quorum. Use a wired connection.
The "Expected" Trap: When you manually change pvecm expected, you are telling the cluster to ignore the safety rules. Only do this when you are performing maintenance on a known-down node and need to regain control of the surviving one.

If you're scaling this into a production-grade environment, this is where the gap between a "homelab" and "infrastructure" becomes clear. For those needing professional help architecting these systems for zero-downtime, I provide infrastructure consulting to handle the messy parts of bare-metal orchestration.