DEV Community

Michael
Michael

Posted on • Originally published at gbase.cn

How GBase 8a Detects and Responds to a GNode Process Crash

When a data node (gnode) process exits unexpectedly, GBase 8a's multi‑layer monitoring kicks in automatically. The cluster will try to restart the service, isolate the failed node, switch traffic away, and later resync the data — all without human intervention in most cases.

How Faults Are Detected

  • Process‑level monitoring (GCMonit) runs on every node, watching core processes like gbased and syncserver. The moment a process dies, GCMonit attempts an automatic restart according to its configuration.
  • Cluster‑level heartbeat (GCware) tracks every GNode's heartbeat. If heartbeats time out or stop, GCware marks the node's service state as CLOSE.

Automated Response Sequence

  1. Automatic restart — GCMonit tries to relaunch the gbased process immediately. This first line of defense recovers from transient issues like memory spikes.
  2. Service isolation and traffic redirection — If the restart fails, or GCware declares the node dead first, GCware sets the node status to CLOSE and notifies the GCluster coordinator. New queries are then routed exclusively to healthy nodes that hold replicas of the failed node's data. This switch is transparent to applications.
  3. Data consistency repair — Once the node comes back (either via auto‑restart or manual recovery), GCware logs inconsistency events (DML_EVENT / DDL_EVENT). The GCrecover process picks up these events and triggers SyncServer to copy fresh data from healthy replicas to the recovered node until it is fully consistent.

When Manual Intervention Is Needed

  • The monitoring process itself (GCMonit or gcware_monit) crashes, breaking the auto‑restart chain.
  • The gbased process fails to start repeatedly due to misconfiguration, disk full, or memory exhaustion.
  • A majority of GCware nodes fail, locking the cluster and disabling automated recovery.
  • Routine checks reveal a node stuck in CLOSE or a process permanently DOWN.

Common manual commands:

# Check cluster and node status
gcadmin showcluster vc <vc_name>

# Restart all services on the failed node
gcluster_services all restart

# Inspect the error log
tail -100f /opt/gbase/gnode/log/gbase/system.log
Enter fullscreen mode Exit fullscreen mode

GBase 8a’s multi‑layered automation ensures that a single gnode crash rarely causes a service disruption. In a typical gbase database deployment, the DBA’s role shifts from firefighting to monitoring edge cases where the automated machinery needs a helping hand.

Top comments (0)