Michael

Posted on May 20 • Originally published at gbase.cn

Building a Rock‑Solid HA Strategy for GBase 8a: From Cluster Architecture and Failover to Consistency Repair

#gbase #database #数据库 #consistency

What separates a reliable GBase 8a deployment from a fragile one isn't how fast a single query runs — it's whether the system keeps serving after a failure, how quickly it recovers, and whether the data stays consistent afterward. This is where high availability (HA) design earns its keep.

GBase 8a, as a distributed analytical database, structures its HA into three layers: cluster‑level, node‑level, and process‑level. Cluster‑level HA relies on data sync tools and mirror clusters; node‑level HA revolves around Gcluster, Gnode, and Gcware nodes; process‑level HA depends on real‑time monitoring and auto‑recovery of core services.

1. The Three HA Layers

Layer	Primary Goal	Core Capability	Best For
Cluster‑level	Survive entire‑cluster failure	Inter‑cluster sync, mirror clusters	Remote DR, intra‑city active‑active, read/write split
Node‑level	Survive single‑node failure	Gcluster Failover, Gnode multi‑replica, Gcware Raft	Server crashes, partial node anomalies
Process‑level	Survive service‑process crash	Process monitoring, auto‑recovery	Transient faults, self‑healing

This layering directly determines your strategy. If your concern is "a machine goes down but business must continue," focus on node‑level HA. If it's "a whole data center fails and we must switch to another," that's cluster‑level HA.

2. Cluster‑Level HA: Disaster Recovery vs. Active‑Active

GBase 8a offers two paths at the cluster level: inter‑cluster sync and mirror clusters.

Approach	Sync Mode	Typical Scenario	Characteristic
Inter‑cluster Sync	Incremental	Remote DR, T+1 reporting, cascading sync	Async, DR‑oriented
Mirror Cluster	Real‑time	Intra‑city active‑active, failover, read/write split	Real‑time, business‑continuity‑oriented

The data sync tool supports incremental sync between two homogeneous GBase 8a clusters based on data blocks rather than traditional log replay — far more efficient for massive data volumes. Mirror clusters synchronize data in real time; once the primary writes, data flows instantly to the backup cluster, transparent to applications, and supports read/write splitting on top.

How to choose: If the primary writes, the standby mainly reads, some sync delay is acceptable, and the focus is remote DR, go with inter‑cluster sync. If the standby must be readable almost immediately after writes, you want to offload read traffic, and smooth intra‑city failover is critical, mirror clusters are the better fit.

3. Node‑Level HA: The Insurance That Fires Most Often

GBase 8a has three node types — Gcluster (scheduling), Gnode (storage & compute), Gcware (management) — and their HA logic differs.

3.1 Gcluster: Don't Let the Entry Point Become a Single Point

Gcluster handles access, authentication, SQL parsing, and scheduling. Gcluster nodes are independent and support Failover: when one node fails, others take over its in‑flight tasks. As long as one healthy Gcluster node remains, the cluster stays online. The real risk is not Gcluster itself, but connecting applications that always point to a single address.

3.2 Gnode: Replica Count Determines Fault Tolerance

Gnode stores data and runs computations. Its HA relies on multi‑replica mechanisms. With 3 replicas, each piece of data has three copies on different Gnode nodes; even if two nodes become unavailable, the remaining replica still provides access.

Replica Count	Availability	Risk Profile
1	Almost no node‑level fault tolerance	Node failure = data unavailable
2	Some redundancy	Recovery and consistency pressure higher in edge cases
3	Production‑grade	Stronger node‑level fault tolerance

"Three replicas are safer" doesn't mean "the cluster can lose any two machines casually." Actual availability also depends on replica placement, hot‑spot data, Gcware state, and node topology.

3.3 Gcware: The Arbitration and Consistency Core

Gcware manages cluster metadata consistency and data consistency, using the Raft protocol. As long as the surviving Gcware nodes satisfy Raft's minimum quorum, the Gcware cluster continues to function.

Gcware Nodes	Recommendation	Notes
1	Not for production	Obvious single point
2	Generally discouraged	Too little quorum margin
3	Commonly recommended	Good balance of cost and availability
5	For higher availability requirements	More stable, higher cost

4. Process‑Level HA: Small Faults That Shouldn't Escalate

Core GBase 8a processes (GNode, GCluster, GCware, etc.) are continuously monitored and can auto‑recover after failure. A practical daily check:

ps -ef | egrep 'gcware|gcluster|gnode'
gcadmin
tail -100 /opt/gbase/gcluster/log/system.log
tail -100 /opt/gbase/gcware/log/gcware.log

5. Primary‑Replica Inconsistency: The Hardest HA Problem

A node crash is usually detected fast, but replica inconsistency can lurk while the cluster still appears operational — producing drifting results and subtle anomalies. Common causes: inconsistent local parameters, power loss or kernel panic, RAID controller or driver anomalies, VM abnormal exit, manual mistakes (e.g., deleting events during a node outage).

GBase 8a uses direct I/O for writes; it considers a write successful only when the return confirms it. But if the underlying environment fails, a "successful" write may not have reached physical disk. The lesson: don't blame all consistency issues on database logic — hardware, virtualization, and host stability are integral parts of HA.

6. Key Parameter: gcluster_suffix_consistency_resolve

GBase 8a provides the gcluster_suffix_consistency_resolve parameter to handle primary‑replica inconsistency:

Value	Behavior
0 (default)	Does not attempt automatic resolution
1	Tries to automatically resolve consistency issues

This parameter supports both session and global scope. It can automatically detect and repair scenarios like row‑count mismatches, schema differences, and SCN inconsistencies across replicas. Before enabling in production, verify version support, confirm the cluster has at least 3 host nodes, and validate in a test environment.

SET GLOBAL gcluster_suffix_consistency_resolve = 1;

7. Parameter Consistency: The Most Overlooked Foundation

Many teams obsess over replica counts, active‑active setups, and failover while neglecting the most basic layer: parameter consistency across nodes. Community documentation explicitly lists "parameter differences" as a common cause of replica inconsistency.

Parameter Category	Recommendation	Reason
Consistency‑related	Uniform across all nodes	Prevent replica behavior drift
Resource limits	Uniform across all nodes	Avoid weak‑link nodes
Log levels	Adjustable, but keep records	Troubleshooting convenience
Experimental params	Test environment first	Reduce production drift

A quick baseline check:

for host in 203.0.113.41 203.0.113.42 203.0.113.43
do
  echo "===== $host ====="
  ssh $host "grep gcluster_suffix_consistency_resolve /opt/gbase/conf/* 2>/dev/null"
done

8. Read/Write Split Is Also Part of HA

GBase 8a supports multiple read/write split approaches. Their value isn't just performance — they keep the standby side actively serving reads, so the standby isn't idle, and the switchover cost is lower when needed.

Method	Granularity	Primary Orientation
Replicated Table	Table‑level	Write‑once‑read‑many within a node
Mirror Cluster	Cluster‑level	Real‑time read/write split
Inter‑cluster Sync	Cluster‑level	Scheduled sync, DR‑style read/write split

9. Recommended Rollout Sequence

Solidify node‑level HA first: Multi‑entry Gcluster access, no single‑replica core data on Gnode, odd‑numbered Gcware nodes for quorum.
Establish parameter and configuration baselines: Track all config changes, periodically compare key parameters across nodes, ensure temporary tweaks can be rolled back.
Then choose the cluster‑level path: Remote DR with acceptable sync delay → inter‑cluster sync. Intra‑city active‑active with real‑time read/write split → mirror cluster.
Finally, run failover drills: At minimum, cover Gcluster single‑point failure, Gnode replica node anomaly, Gcware node loss, and primary‑replica inconsistency detection & repair.

10. Daily Inspection Template

# Cluster status
gcadmin

# Key processes
ps -ef | egrep 'gcware|gcluster|gnode'

# Key logs
tail -100 /opt/gbase/gcluster/log/system.log
tail -100 /opt/gbase/gcware/log/gcware.log

# Parameter consistency across nodes
for host in 203.0.113.41 203.0.113.42 203.0.113.43
do
  echo "===== $host ====="
  ssh $host "grep gcluster_suffix_consistency_resolve /opt/gbase/conf/* 2>/dev/null"
done

Closing

GBase 8a HA isn't a single feature — it's a layered system. Looking up: inter‑cluster sync, mirror clusters, read/write split. Looking across: Gcluster, Gnode, Gcware three‑layer node fault tolerance. Looking down: process‑level self‑healing and parameter consistency control. The most stable designs aren't the most complex ones — they're the ones that solidify the node‑level foundation first, then layer on cluster‑level capabilities as the scenario demands.

A well‑architected gbase database HA strategy keeps your data available and consistent through failures both small and large — and that's what production maturity really looks like.

DEV Community