DEV Community

Konrad Kądzielawa
Konrad Kądzielawa

Posted on

Post-Mortem: Diagnosing and Fixing Ceph Quorum Breakdown (Time Drift) in a Hybrid Proxmox VE Cluster

Key Takeaways & Conclusions:

  • Root Cause: A Time Drift between the physical Proxmox host and nested/external virtual machines exceeded Ceph's strict 0.05s tolerance limit.
  • Impact: The time mismatch caused a continuous authorization loop (electing/probing) in the local Ceph Monitor (MON), corrupting its local RocksDB database and triggering a cascading failure of the dependent OSD.0 daemon.
  • Resolution: Enforced rigorous time synchronization via Chrony and QEMU Guest Agent, followed by a destructive purge and rebuild of the corrupted MON daemon using the surviving quorum.
  • Verification: Chaos Engineering tests (rados bench) confirmed that with a min_size = 2 replication policy, the cluster successfully survives a hard node power-off without dropping client I/O.

1. The Architecture and The Incident

Building a Ceph cluster typically requires dedicated bare-metal servers. My home lab topology, however, is a hybrid 3-node setup designed to simulate a full cluster using limited hardware:

  • admin (Physical Proxmox host, IP: 192.168.0.250)
  • admin-02 (Nested VM 103 on the main host, IP: 192.168.0.251)
  • admin-03 (External VM running on a separate Ubuntu desktop, IP: 192.168.0.252)

The incident began when the cluster state dropped to HEALTH_WARN. Despite the underlying Corosync cluster and network layer (ICMP) functioning perfectly across the 192.168.0.0/24 subnet, the main physical node (admin) dropped out of the Ceph quorum:

root@admin:~# ceph -s
  cluster:
    id:     25f2d0a2-d64b-42a2-b93d-99a560571e0e
    health: HEALTH_WARN
            1/3 mons down, quorum admin-02,admin-03
            1 osds down
            1 host (1 osds) down
Enter fullscreen mode Exit fullscreen mode

2. Debugging & Log Analysis

To find the root cause, I inspected the systemd journals for the failed Object Storage Daemon (osd.0) located on the physical admin host.

root@admin:~# journalctl -u ceph-osd@0.service -n 50 --no-pager
Mar 25 21:08:21 admin ceph-osd[3147]: failed to fetch mon config (--no-mon-config to skip)
Mar 25 21:08:21 admin systemd[1]: ceph-osd@0.service: Main process exited, code=exited, status=1/FAILURE
Enter fullscreen mode Exit fullscreen mode

The OSD drive itself was physically healthy, but it crashed on startup because it could not retrieve the Cluster Map from the local Monitor (ceph-mon).

Next, I checked the local Monitor service on the admin node:

root@admin:~# journalctl -u ceph-mon@admin.service -n 30 --no-pager
Mar 25 21:49:55 admin ceph-mon[1377]: 2026-03-25T21:49:55.526+0100 7c00b3b756c0 -1 mon.admin@0(probing) e3 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 30 bytes epoch 0)
Mar 25 21:50:50 admin ceph-mon[1377]: 2026-03-25T21:50:50.528+0100 7c00b3b756c0 -1 mon.admin@0(electing) e3 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 30 bytes epoch 0)
Enter fullscreen mode Exit fullscreen mode

The logs revealed the local MON was stuck in an endless electing/probing state, constantly dropping cryptographic authorization packets (auth slow ops). According to Ceph's official clock synchronization documentation, Monitors require clocks to be synchronized within 0.05 seconds. Because admin-02 and admin-03 were virtual machines, their CPU clock cycles drifted away from the physical host's hardware clock, effectively locking the admin node out of the quorum and corrupting its metadata database.

3. The Action Plan: Fixing Time Drift & Rebuilding

To stabilize the cluster, I had to enforce strict timekeeping on the virtual nodes and reconstruct the corrupted Monitor.

Table: Remediation Steps per Node

Node Ceph Role Root Cause Implemented Solution
admin (Physical) MON, OSD.0 Corrupted RocksDB Destroyed and recreated local MON daemon.
admin-02 (VM 103) MON, OSD.1 Hypervisor Time Drift Installed QEMU Guest Agent for hardware sync.
admin-03 (Ext. VM) MON, OSD.2 OS-level Time Drift Deployed Chrony to force strict NTP sync.

Step 1: Enforcing Strict NTP on the External VM (admin-03)

Standard systemd-timesyncd is too sluggish for Ceph. I replaced it with chrony and forced an immediate clock step on the external Ubuntu VM:

root@admin-03:~# sudo apt-get update && sudo apt-get install chrony -y
root@admin-03:~# sudo chronyc makestep
root@admin-03:~# chronyc tracking
Reference ID    : C292FB72 (host-194-146-251-114.virtuaoperator.pl)
Stratum         : 2
System time     : 0.000000000 seconds slow of NTP time
Enter fullscreen mode Exit fullscreen mode

The output confirmed a 0.000000000 seconds deviation.

Step 2: Enabling QEMU Guest Agent (admin-02)

For the nested Proxmox node, time had to be injected directly from the physical hypervisor. I enabled the QEMU Guest Agent flag, performed a hard power cycle of VM 103, and verified the communication channel:

# Executed on the physical host (admin)
root@admin:~# qm set 103 -agent 1
root@admin:~# qm shutdown 103
root@admin:~# qm start 103
root@admin:~# qm agent 103 ping
Enter fullscreen mode Exit fullscreen mode

Step 3: Reconstructing the Corrupted Monitor (admin)

With the time drift neutralized, I had to fix the broken ceph-mon daemon on the physical host. Attempting to repair a broken RocksDB is often futile. Since the other two nodes maintained a healthy quorum and an intact Cluster Map, I simply purged the broken node and recreated it:

root@admin:~# pveceph mon destroy admin
root@admin:~# pveceph mon create
root@admin:~# systemctl restart ceph-osd@0.service
Enter fullscreen mode Exit fullscreen mode

Immediately after the restart, osd.0 successfully authenticated with the fresh Monitor, and the cluster began the peering process, eventually returning to HEALTH_OK.

4. Verification: Chaos Engineering

You can't trust a storage cluster until you break it while it's under load. To prove the fix worked, I initiated a continuous benchmark write on the ceph-storage pool:

root@admin:~# rados bench -p ceph-storage 120 write --no-cleanup
Enter fullscreen mode Exit fullscreen mode

While the benchmark was running at ~55 MB/s, I executed a brutal sudo systemctl poweroff on the external admin-03 node.

Here is what happened in the benchmark logs:

  38       16       419       403   42.4168         0            -     1.12533
  39       16       419       403   41.3292         0            -     1.12533
2026-03-25T22:31:25.378550+0100 min lat: 0.334379 max lat: 3.14551 avg lat: 1.12533
  40       16       419       403   40.2959         0            -     1.12533
  41       16       419       403   39.3131         0            -     1.12533
  42       16       439       423   40.2816         8     0.399663      1.5659
  43       16       470       454   42.2282       124     0.593991      1.5049
  44       16       489       473   42.9956        76     0.755278     1.47177
Enter fullscreen mode Exit fullscreen mode

The Result:

  1. Seconds 38-41: The moment the node died, the cluster momentarily paused client I/O. Throughput dropped to 0 MB/s, and latency spiked.
  2. Second 42+: The CRUSH map recalculated the topology. Because the pool was configured with min_size = 2, the two surviving nodes authorized the writes. The cluster automatically resumed operations, quickly catching up and hitting speeds of 124 MB/s.

Once the test concluded, I cleaned up the benchmark data:

root@admin:~# rados -p ceph-storage cleanup
Enter fullscreen mode Exit fullscreen mode

Conclusion: Time synchronization is the absolute lifeblood of Ceph. Never trust default virtual machine clocks in a production or simulated lab environment. By implementing chrony and QEMU hardware sync, the cluster is now fully immune to node-level failures.

Top comments (0)