Node crashes, connection storms, shard anomalies, and resource overloads are the most frequent disruptions in a gbase database cluster. This guide dissects real‑world cases and provides ready‑to‑run commands for diagnosis, emergency response, and root‑cause repair.
1. Failure Categories and Severity
| Type | Typical Symptoms | Impact | Severity |
|---|---|---|---|
| Node failure | Process exit, node offline, related operations fail | Affects specific shards; multi‑node failure may bring down the cluster | High |
| Connection fault | Connection refused, timeouts, maxed‑out connections | Business cannot reach the database | High |
| Shard anomaly | Shard missing, inconsistent, migration failed | Tables become partially unreadable/unwritable | High |
| Resource overload | CPU/Memory/IO >80%, sluggish operations | Whole cluster degrades, may trigger node crashes | Medium |
2. Core Diagnostic Tools
-
Cluster management:
gbase_ctl,gs_om,gs_checkfor status, shard operations, and health checks. -
System views:
pg_stat_activity,pg_stat_databasefor sessions and database stats. -
OS commands:
top,free,iostatfor resource usage;netstatfor network. -
Logs:
$GBase_HOME/log/gbase-xxxx.log— search forerror,failed,timeout.
Standard triage flow: Classify severity → confirm symptoms → isolate scope → diagnose with tools → verify root cause.
3. Real‑World Cases with Step‑by‑Step Recovery
3.1 Node Crash (DN3 OOM)
Diagnosis:
-
gs_om -t statusshowed DN3 offline. - Logged into DN3;
gbase_ctl statusindicated the process was dead. - Checked
dn-xxxx.log— foundout of memory. -
topconfirmed 98% memory usage.
Emergency response:
# 1. Kill unnecessary background processes on DN3 to free memory
# 2. Restart the DN3 node
gbase_ctl start -D /data/gbase8c/dn3
# 3. Verify node sync status
gs_om -t status # should show Normal
gs_sync_check # confirm shard data is in sync
# 4. Notify business to test; keep monitoring memory usage
Root fix: Upgraded DN3 memory to 32 GB, set memory alerts (>80%), weekly log cleanup.
3.2 Connection Storm (Firewall Block + Max Connections)
Diagnosis:
-
gs_om -t status— all nodes healthy. Local connection on CN succeeded. - Firewall on CN was enabled without port 5432 allowed.
-
netstat -an | grep 5432showed no external listener. - After opening firewall, some connections still failed —
pg_stat_activityshowed max connections reached.
Emergency response:
# 1. Open the port
systemctl stop firewalld
firewall-cmd --add-port=5432/tcp --permanent
firewall-cmd --reload
# 2. Kill idle sessions
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle';
# 3. Temporarily raise connection limit
ALTER SYSTEM SET max_connections = 2000;
SELECT pg_reload_conf();
Root fix: Permanently open necessary ports, set max_connections to 2000, deploy HikariCP connection pool, regularly purge idle sessions.
3.3 Shard Anomaly (Failed Shard Migration)
Diagnosis:
-
gs_om -t status --detailrevealed 3 shardsabnormalbetween DN1 and DN4. - Cluster log contained
shard migration failed due to network timeout. -
gs_shard_statusconfirmed metadata inconsistency.
Emergency response:
# 1. Abort the failed migration
gs_om -t stop_shard_migration --shard_id=xxx,xxx,xxx
# 2. Recover the shards back to the original node DN1
gs_om -t recover_shard --shard_id=xxx,xxx,xxx --target_node=DN1
# 3. Verify shard status is normal
gs_om -t status --detail | grep "shard"
# 4. Synchronise shard data
gs_sync_check --shard_id=xxx,xxx,xxx
gs_shard_sync --shard_id=xxx,xxx,xxx # if inconsistencies remain
Root fix: Verify network before migration, set timeouts, schedule migrations during off‑peak hours; weekly shard status checks and metadata backups.
3.4 Resource Overload (Cluster‑Wide CPU/Memory/IO Spikes)
Diagnosis:
-
top/free/iostatshowed all DNs CPU >90%, memory >85%. -
pg_stat_activityrevealed many long‑running slow queries plus a batch write job. - Logs contained
CPU usage highwarnings.
Emergency response:
-- 1. Terminate slow queries running >30 seconds
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE now() - query_start > '30 seconds';
-- 2. Suspend non‑critical batch jobs (kill the script or stop the scheduler)
-- 3. Prioritise core business operations
ALTER SYSTEM SET resource_manager = 'priority';
ALTER SYSTEM SET core_business_priority = 'high';
SELECT pg_reload_conf();
Monitor top -d 5 and iostat -x 5 until loads drop.
Root fix: Optimise slow queries, reschedule batch jobs to off‑peak hours, consider cluster expansion, set load alerts (>75%).
4. Post‑Incident Review and Long‑Term Safeguards
- Incident review: Within 24 hours, document the cause, response, and preventive actions.
- Continuous monitoring: Prometheus+Grafana with multi‑level alerts.
- Standardised procedures: SOPs for shard migration, node expansion, configuration changes.
- Weekly inspection: Node health, resource usage, logs, shard status.
- Quarterly drills: Simulate node crash, shard anomaly, and connection flood scenarios.
Every command and workflow presented here has been battle‑tested in production gbase database environments. Apply them directly to keep your GBASE clusters stable and your recovery times short.
Top comments (0)