DEV Community

Chikara Inohara
Chikara Inohara

Posted on

Deep Dive: How Proxmox Actually Keeps Your Cluster in Sync (Corosync & pmxcfs Internals)

⚠️ Fair warning: I'm still learning this stuff, so some details might not be 100% perfect. Take it as a fellow homelab explorer's notes, not official docs!

In my last post, we talked about the outside view of a Proxmox cluster — quorum, split-brain, and how Corosync's strict timeouts decide when a node is declared dead. We looked at token-passing and fencing from a bird's eye view.

This time, let's crack open the hood and look inside.

Proxmox VE's cluster features are incredibly powerful, but for a lot of us, it feels like a black box. How does it actually stay in sync? What happens byte-by-byte when you change a VM config?

I went down a research rabbit hole diving into the source code of Corosync and pmxcfs, and here's what I found.


🏗️ Architecture Overview: Two Key Components

Everything in a Proxmox cluster boils down to two layers working in tight coordination:

Architecture diagram showing two Proxmox nodes. Each has a UI layer (Pveproxy + Pvedaemon), a cluster management layer (pmxcfs in RAM + Corosync), a VM layer, and a disk layer storing config.db. An arrow between nodes shows pmxcfs syncing via Corosync, with FUSE mounting the in-memory DB to the filesystem.

📝 Note: This diagram is reused from my original Japanese article on Qiita — too lazy to redraw it in English, sorry! The key things to spot: pmxcfs lives in RAM on each node, syncs between nodes via Corosync, and the SQLite DB is persisted to disk (shown as USB here — more on why that matters later!).

Corosync (Totem Protocol)

Job: Cluster membership management + message ordering guarantee

It provides something called Virtual Synchrony — every node receives messages in the exact same order. This is achieved through the token-passing mechanism we covered last time.

pmxcfs (Proxmox Cluster File System)

Job: Manages all the config files you see under /etc/pve

Here's the fun part — it's actually a SQLite database living in memory on each node. It just looks like a regular filesystem thanks to FUSE mounting. Wild, right?


🔄 Corosync / Totem Protocol: The Details

The heart of Corosync is the Totem Single-Ring Protocol. Regardless of your physical network topology, it creates a logical ring of nodes and circulates a special packet called a token around that ring to control who can send messages.

Token Passing

Only the node currently holding the token is allowed to broadcast (multicast) a message. This elegantly prevents write conflicts — no two nodes can write simultaneously. Everything is serialized.

Node A → [Token] → Node B → [Token] → Node C → [Token] → back to Node A
Enter fullscreen mode Exit fullscreen mode

Two diagrams of a 4-node ring. Left: node 4 holds the TOKEN and multicasts Message1 out to nodes 1, 2, and 3 simultaneously. Right: the token has moved to node 1, which now multicasts Message2 to all other nodes.

📝 Another one from the Japanese version! Left: node 4 holds the token and multicasts Message1 to all nodes. Right: the token has passed to node 1, which now multicasts Message2. Only the token holder gets to send — everyone else just listens.

ARU (All Received Up to)

The token carries a sequence number called aru — short for "All Received Up to." Think of it as a receipt: "Everyone in the ring has confirmed they got messages up to this point."

When the token completes a full loop and comes back with an updated ARU, the original sender knows with certainty: everyone got it.

What Actually Happens When a Token Arrives (totemsrp.c)

Based on the Corosync source code (exec/totemsrp.c), here's the processing order:

  1. Receive token from the previous node
  2. Retransmit check — did I miss any messages? If so, request retransmission
  3. Multicast send — flush any pending messages (like pmxcfs config changes) out to the network
  4. Update & pass — increment the sequence number, hand the token to the next node

💾 How pmxcfs Syncs Data: The Journey of a Write

Okay, here's where it gets really interesting. What actually happens when you edit a VM config in the Proxmox UI?

Step 1: Write Request from Application

A process like pvedaemon writes to /etc/pve/qemu-server/100.conf. This gets intercepted by FUSE and handed off to the pmxcfs process.

Step 2: CPG Broadcast via Corosync

pmxcfs bundles the change as a transaction and sends it through Corosync's CPG (Closed Process Group) API — essentially asking Corosync to deliver this to every node in the cluster.

The data sits in Corosync's send buffer, waiting for the token to come around.

Step 3: Receive and Immediately Persist ← This is the critical part

When each node's pmxcfs receives the transaction (including the original sender!), it does two things:

1. Update in-memory SQLite DB

The change is applied to the node's in-memory database instantly.

2. fsync() to disk

This is the big one. pmxcfs immediately calls fsync() on the backing SQLite file to flush it to the physical disk.

fsync() blocks until the OS confirms the data has been physically written to storage. No faking it.

Step 4: Transaction Committed

Once every node's fsync() completes and the token comes back with an updated ARU, that transaction is officially committed cluster-wide. Consistency guaranteed. ✅


⚡ Why Your System Disk I/O Matters More Than You Think

Now the architectural picture should make the consequences clear: Proxmox's config sync waits for every node to finish writing to disk.

The Domino Effect of Slow I/O

Slow fsync on Node C
  → pmxcfs on Node C is blocked
    → Corosync process stalls
      → Token circulation delayed
        → Timeout triggered
          → Node declared dead 💀
Enter fullscreen mode Exit fullscreen mode

All because of a slow disk write. That's how tightly coupled these components are.

⚠️ Homelab Warning: Watch Your System Disk!

This is particularly nasty for common homelab setups:

  • Cheap USB sticks as boot media
  • Old spinning HDDs for the OS
  • Network-attached storage running the system

VM/container storage being slow? Usually fine. But Proxmox's own system disk being slow? That can destabilize your entire cluster.

💡 You can measure this yourself! Proxmox ships with a built-in benchmark tool called pveperf. Run it and check the fsync/s number. In my own testing: a USB stick scored 30–50 fsync/s, while an SSD hit 3,000+. That's nearly a 100x difference!

What to Actually Use

Storage Type Homelab OK? Notes
NVMe / SATA SSD ✅ Great Ideal for system disk
Enterprise SSD (with PLP) ✅ Best Power-loss protection = extra safety
2.5" HDD ⚠️ Okay-ish Watch for latency spikes
USB stick ❌ Avoid Way too slow for fsync
SD card ❌ Avoid Same problem, often worse

📋 Summary

Here's the full picture of what we covered:

Component Role
Corosync / Totem Token-passing ring, message ordering, membership
ARU Confirms all nodes received each message
pmxcfs In-memory SQLite DB, FUSE-mounted as /etc/pve
fsync() Blocks until data hits physical disk on every node
System disk I/O Directly impacts cluster stability

The Practical Takeaway

When choosing hardware for a Proxmox cluster, most people think: CPU → RAM → Network → Storage. But for cluster stability, you should actually be thinking about fsync latency early in your planning.

Even in a homelab, using a fast SSD for the system disk (not just VM storage) will make your cluster dramatically more stable.

Pair this knowledge with the timeout tuning from the last post, and you'll have a much more resilient setup!


📚 References

The Totem Single-Ring Ordering and Membership Protocol (paper)

Corosync Source Code (exec/totemsrp.c, lib/cpg.c)

Proxmox VE Docs — Cluster Network


If you found this useful, drop a ❤️! And if you spot anything I got wrong, please call it out in the comments — I'm still learning and corrections are very welcome 🙏

Top comments (0)