DEV Community

Cover image for Solved: Pacemaker/DRBD: Auto-failback kills active DRBD Sync Primary to Secondary. How to prevent this?
Darian Vance
Darian Vance

Posted on • Originally published at wp.me

Solved: Pacemaker/DRBD: Auto-failback kills active DRBD Sync Primary to Secondary. How to prevent this?

🚀 Executive Summary

TL;DR: Pacemaker’s default auto-failback behavior can disrupt an active DRBD primary by attempting premature promotion on a recovering node, leading to service outages and potential data risks. This issue can be prevented by configuring negative resource stickiness, implementing manual failback, or carefully setting up graceful and delayed promotion with robust STONITH.

🎯 Key Takeaways

  • Setting a high negative resource-stickiness value (e.g., -10000) on the DRBD master/slave clone resource reliably prevents automatic failback, ensuring resources remain on the current primary until manually moved.
  • Manual failback strategies, such as placing a recovering node into standby or using location constraints to assign negative scores for the Promoted role, provide complete administrative control over when DRBD resources are promoted.
  • Achieving graceful and delayed promotion requires robust STONITH, increasing cluster-delay for state propagation, and configuring generous promoted-stop-timeout values to ensure the old primary safely demotes before a new one promotes.

Pacemaker/DRBD clusters, while providing high availability, can sometimes exhibit problematic auto-failback behavior where a recovering node attempts to re-assert its primary role, leading to resource conflicts and service disruption. Learn how to prevent these “kill” scenarios and ensure graceful failovers.

Introduction: The Peril of Premature Failback

High-availability clusters built with Pacemaker and DRBD are critical components in modern infrastructure, ensuring services remain operational even during node failures. DRBD provides block-device replication, while Pacemaker orchestrates resources, including DRBD, across nodes. A common challenge arises when a failed node recovers: Pacemaker’s default behavior often prioritizes resource locality, attempting to “failback” resources to their preferred node.

In a DRBD context, this can be disastrous. If the recovering node tries to promote its DRBD resource to Primary while another node is already actively serving as Primary/UpToDate, it creates a split-brain scenario or, more commonly, a forceful demotion/kill of the active primary, leading to service outages, data corruption risks, and general cluster instability. This post details why this happens and provides robust solutions to prevent it.

Symptoms: What Does Uncontrolled Auto-Failback Look Like?

When Pacemaker attempts a premature or uncontrolled failback, several symptoms can indicate the issue:

  • Service Outages: Applications running on the DRBD resource unexpectedly stop or become unresponsive on the currently active primary node.
  • DRBD Status Changes: You might observe the active DRBD Primary resource suddenly transitioning to Secondary, Unknown, or a connection state indicating a conflict (e.g., WFConnection, StandAlone).
  • Pacemaker Log Entries: The Pacemaker logs (e.g., /var/log/pacemaker/pacemaker.log or system journal) will show attempts to promote the DRBD resource on the recovering node, often followed by demotion attempts on the currently active node or fencing actions. Look for messages related to drbd_promote, drbd_demote, or conflicts.
# Example Pacemaker log snippet indicating a problem
Sep 20 10:35:01 node-a pacemakerd[12345]: info: Status: Requesting promote of drbd_res on node-a
Sep 20 10:35:01 node-a pacemakerd[12345]: crit: Result: promote_drbd_res_on_node-a: CIB_R_ERR_OP_FAILED
Sep 20 10:35:01 node-b pacemakerd[12345]: info: Status: Requesting demote of drbd_res on node-b
Sep 20 10:35:01 node-b pacemakerd[12345]: info: drbd_demote: stdout [drbd_demote: Attempting to demote resource 'r0']
Sep 20 10:35:02 node-b pacemakerd[12345]: warn: drbd_demote: stderr [drbd_demote: Cannot demote 'r0', it is still in use.]
Sep 20 10:35:02 node-b pacemakerd[12345]: crit: Result: demote_drbd_res_on_node-b: CIB_R_ERR_OP_FAILED
Enter fullscreen mode Exit fullscreen mode
  • drbd-overview Output: Running drbd-overview will show the DRBD resource status. During an issue, you might see unexpected roles or connections.
# Example drbd-overview output during conflict
0:r0   Connected Primary/Primary UpToDate/UpToDate
       [WARNING: This indicates split-brain in a two-node cluster, which Pacemaker should prevent]
       [More likely, you'll see a quick flip or errors.]
Enter fullscreen mode Exit fullscreen mode

Ideally, Pacemaker, especially with fencing (STONITH) enabled, should prevent true split-brain where both nodes are Primary. However, the aggressive failback can lead to a race condition where the recovering node attempts promotion before the active node can be safely demoted, or before the cluster has a clear picture of the state, causing the active primary to be forcefully taken down or experience severe I/O issues.

Root Cause Analysis: Why Auto-Failback Kills DRBD Primary

The core of the problem lies in Pacemaker’s default resource management behavior and its interaction with DRBD’s stateful nature:

  1. Resource Locality Preference: Pacemaker often tries to keep resources on their preferred nodes. When a node recovers, Pacemaker sees it as a suitable candidate for hosting resources again.
  2. DRBD Primary Requirement: For most applications, a DRBD resource must be in the “Primary” role to be mounted and serve data. Only one node can be Primary at a time in a two-node synchronous DRBD setup (Protocol C).
  3. Premature Promotion Attempt: Upon node recovery, Pacemaker evaluates resource placement. If the recovering node is its ‘preferred’ location (e.g., due to configuration default or historical reasons), Pacemaker might attempt to promote the DRBD resource to Primary *immediately*.
  4. Conflict with Active Primary: If another node is currently acting as the DRBD Primary, this immediate promotion attempt by the recovering node will either:
    • Fail (if the DRBD resource agent is robust enough to detect another primary).
    • Lead to a race condition where both nodes briefly believe they should be primary.
    • Trigger DRBD’s internal mechanisms to resolve the conflict (e.g., automatic demotion of one, or fencing if configured), which can be disruptive.
    • Most dangerously, in a poorly configured cluster, it can cause I/O disruption on the existing primary, leading to application failure. The “kill” happens when the active node is forced out of its primary role due to this conflict, often leading to ungraceful shutdown of services.
  5. Lack of Graceful Demotion: Pacemaker might not have enough time or a clear mandate to gracefully demote the currently active primary *before* the recovering node tries to assert its primary role. This is exacerbated if fencing (STONITH) is not robustly configured or is too slow.

Solution 1: Preventing Automatic Failback with Negative Resource Stickiness and Location Constraints

This is arguably the most common and robust solution. It tells Pacemaker to avoid moving resources back to a node once they’ve failed away from it, effectively disabling automatic failback for DRBD resources.

Mechanism

By setting a negative resource-stickiness value on the DRBD primary resource, you make it “sticky” to its current location. A very large negative value ensures it stays put. You can further reinforce this with a location constraint that prefers the currently active node or simply prevents it from moving to the recovering node.

Configuration Example

First, define your DRBD master/slave resource. Let’s assume your resource is named drbd_r0 and your filesystem/application resource is fs_data.

# Define your DRBD Master/Slave resource (example)
pcs resource create drbd_r0 ocf:linbit:drbd \
    drbd_resource=r0 \
    op monitor interval="60s" \
    op promote interval="30s" start-timeout="90s" stop-timeout="90s" \
    op demote interval="30s" start-timeout="90s" stop-timeout="90s" \
    --clone globally-unique=true ordered=true interleave=true

# Add a negative resource-stickiness to prevent automatic failback
# This tells Pacemaker: "Don't move this resource back unless explicitly told to."
pcs resource meta drbd_r0-clone resource-stickiness=-10000

# Create a filesystem resource that depends on drbd_r0 being primary
pcs resource create fs_data ocf:heartbeat:Filesystem \
    device="/dev/drbd/by-res/r0" directory="/mnt/data" fstype="ext4" \
    op monitor interval="30s"

# Ensure fs_data starts only when drbd_r0 is promoted
pcs constraint colocation add fs_data with drbd_r0-clone INFINITY target-role=Promoted

# Ensure fs_data starts after drbd_r0 is promoted
pcs constraint order promote drbd_r0-clone then start fs_data
Enter fullscreen mode Exit fullscreen mode

The key here is pcs resource meta drbd_r0-clone resource-stickiness=-10000. This high negative score means if the resource fails over to node-b, it will stay on node-b even if node-a recovers, unless manually moved.

Pros and Cons

Pros Cons
Highly predictable and reliable. Requires manual intervention (pcs resource move or pcs resource migrate) to failback the resource to the original primary node once it has recovered.
Prevents split-brain scenarios caused by aggressive auto-failback. Increased downtime if manual intervention is slow after a node recovers and you want to return to the preferred node.
Simplifies troubleshooting by eliminating one potential source of resource flapping. Might lead to resources remaining on less-preferred nodes for extended periods.

Solution 2: Implementing Manual Failback with Administrative Confirmation

This solution ensures that a returning node never automatically promotes its DRBD resource without explicit administrative approval. It effectively puts the recovering node in a “waiting room” until deemed safe to promote.

Mechanism

This approach involves setting target-role=Stopped for the DRBD primary resource on the recovering node, preventing Pacemaker from automatically starting and promoting it. This can be done via node-specific resource attributes or by placing the entire node into maintenance mode temporarily.

Configuration Example

Assuming the previous DRBD clone setup:

  1. Place the recovering node into maintenance mode: When a node comes back online, Pacemaker will detect it. You can immediately put it into maintenance mode before it has a chance to mess with resources.
   # On the administrative workstation or another node
   pcs node standby <recovering_node_name>
Enter fullscreen mode Exit fullscreen mode

This will prevent Pacemaker from trying to run any resources on <recovering_node_name>. Once you’ve verified the node’s health and are ready to consider a failback (which would still be manual via pcs resource move), you would bring it out of standby:

   pcs node unstandby <recovering_node_name>
Enter fullscreen mode Exit fullscreen mode
  1. Using a location constraint to prevent promotion on recovery: While Solution 1 uses resource-stickiness, you can also create a location constraint that assigns a very low score to the recovering node for the Promoted role of your DRBD resource.
   # Assume node-a is preferred for drbd_r0.
   # If node-a fails and drbd_r0 moves to node-b, when node-a recovers,
   # we want to prevent it from automatically promoting drbd_r0.

   # This constraint tells Pacemaker that if drbd_r0-clone is "Promoted"
   # on <recovering_node_name>, it gets a score of -INFINITY, effectively preventing it.
   pcs constraint location drbd_r0-clone prefers <other_node_name>=100 target-role=Promoted
   pcs constraint location drbd_r0-clone avoids <recovering_node_name>=INFINITY target-role=Promoted

   # More simply, combine with Solution 1's stickiness:
   # pcs resource meta drbd_r0-clone resource-stickiness=-10000
   # This combined with the initial state makes it stick to the current primary.
Enter fullscreen mode Exit fullscreen mode

When the previously failed node comes back up, Pacemaker will start the DRBD resource clone in the Secondary role, but it won’t promote it to Primary due to the negative score for that role. An administrator then explicitly moves the resource.

Pros and Cons

Pros Cons
Provides complete administrative control over failback. Requires continuous monitoring and manual intervention after a node recovers.
Minimizes risk of unintentional primary conflicts. Potentially longer downtime for failback operations as human interaction is needed.
Guarantees verification of node health before resources are promoted. Less “automatic” in a High Availability context.

Solution 3: Configuring Pacemaker for Graceful & Delayed Promotion

Instead of completely preventing failback, this approach focuses on making the failback process inherently safer by giving Pacemaker ample time and clear instructions to demote the old primary *before* promoting a new one, thereby preventing the “kill” scenario.

Mechanism

This solution leverages several Pacemaker global options and resource meta-attributes to ensure a sequential and controlled transition of the DRBD primary role. Key elements include robust fencing (STONITH), increasing cluster-delay for state propagation, and carefully configuring timeouts for resource actions.

Configuration Example

  1. Ensure Robust Fencing (STONITH): This is paramount. If Pacemaker cannot reliably fence a failed node, no failback strategy is truly safe.
   pcs property set stonith-enabled=true
   pcs property set no-quorum-policy=stop # Or 'freeze' depending on requirements
   # Ensure you have a working STONITH device configured
   pcs stonith create fence_ipmi_node1 fence_ipmi ipaddr=192.168.1.10 pcmk_host_list=node-a \
       login=admin passwd=password op monitor interval=60s
   pcs stonith create fence_ipmi_node2 fence_ipmi ipaddr=192.168.1.11 pcmk_host_list=node-b \
       login=admin passwd=password op monitor interval=60s
Enter fullscreen mode Exit fullscreen mode
  1. Increase cluster-delay: This gives Pacemaker more time to propagate state changes and prevents premature decision-making.
   pcs property set cluster-delay=60s
Enter fullscreen mode Exit fullscreen mode

This tells Pacemaker to wait 60 seconds after a node joins or leaves before making significant resource placement decisions. Adjust as needed, but be aware of the impact on failover times.

  1. Configure promoted-stop-timeout and stop-failure-is-fatal=false for DRBD:

promoted-stop-timeout: For a Master/Slave resource, this is the maximum time Pacemaker will wait for a demote operation to complete. Setting a generous timeout ensures the old primary has time to demote properly.

stop-failure-is-fatal=false: This tells Pacemaker that if a demote (stop) operation fails on a DRBD primary, it shouldn’t immediately declare the resource permanently failed on that node. Instead, it allows for other recovery actions (like fencing).

   pcs resource update drbd_r0-clone \
       op demote timeout="120s" promoted-stop-timeout="180s" \
       op stop timeout="120s" \
       meta stop-failure-is-fatal=false
Enter fullscreen mode Exit fullscreen mode

Note: promoted-stop-timeout is applied when a cloned resource is currently promoted and Pacemaker attempts to stop it (which effectively means demoting it). stop-failure-is-fatal=false on the resource’s meta attributes, or specifically on the op demote, can prevent a transient demote failure from causing a hard resource failure.

  1. Consider resource-stickiness (positive values) for preferred node: If you want *some* level of auto-failback to a preferred node, but gracefully, use a positive resource-stickiness with the above timeouts. This means a preferred node will eventually get its resources back, but only after Pacemaker has had time to safely demote the other. This relies heavily on STONITH and the timeouts to ensure the demotion of the *current* primary occurs *before* the promotion on the preferred, recovering node.

Pros and Cons

Pros Cons
Enables a more “automatic” failback while trying to mitigate the “kill” scenario. Requires extremely robust and well-tested STONITH; without it, this solution is dangerous.
Optimizes for shorter downtime than manual failback, if successful. Can lead to longer failover times due to increased cluster-delay and operation timeouts.
Leverages Pacemaker’s native recovery mechanisms more fully. Complex to configure and troubleshoot; misconfigurations can still lead to issues.

Conclusion: Choosing the Right Strategy

Preventing Pacemaker’s auto-failback from “killing” your active DRBD primary is crucial for cluster stability. The best solution depends on your operational requirements and risk tolerance:

  • If predictability and absolute prevention of resource flapping are paramount, and you’re comfortable with manual intervention, Solution 1 (Negative Resource Stickiness) is your safest bet.
  • If you need granular control and human oversight before resources return to a recovering node, Solution 2 (Manual Failback) provides that assurance.
  • If you desire a more automated failback but need to ensure it’s handled gracefully with minimal disruption, Solution 3 (Graceful & Delayed Promotion) can work, but it demands meticulous STONITH configuration and extensive testing to be truly reliable.

Regardless of the chosen solution, always ensure your Pacemaker cluster has a properly configured and tested STONITH (fencing) mechanism. Fencing is the last line of defense against data corruption and split-brain scenarios, making any failover strategy significantly safer.


Darian Vance

👉 Read the original article on TechResolve.blog

Top comments (0)