When a GBase 8c node goes offline or behaves erratically, narrowing down the root cause quickly is essential. This guide walks through the common failure scenarios for GBASE's China‑domestically developed database and the commands you’ll use to diagnose them.
1. Operating System Issues
If all instances on a node are down, suspect the OS first.
- Cannot SSH – Ping the host. No response likely means a network outage, kernel panic, or reboot. A panic‑induced reboot can take up to 20 minutes; retry every 5 minutes. If it’s still dead after 20 minutes, a site visit is needed.
- Ping works, but SSH freezes – Usually CPU or I/O exhaustion. Retry a few times; if no luck within 5 minutes, physical intervention is required.
- System responds but is sluggish – Collect OS metrics with:
who
cat /etc/openEuler-release
uname -a
sysctl -a
cat /etc/sysctl.conf
cat /proc/cpuinfo
cat /proc/meminfo
top -H
iostat -x 1 3
vmstat 1 3
# Inspect logs: /var/log/messages or dmesg
The watchdog timer (default 60s) will reset the system if it hangs.
2. Network Problems
Network faults often cause startup failures, UnKnown instance states, or sudden disconnections.
Startup Failure Due to Network Errors
- Port conflict:
netstat -anop | grep 15400
# Kill the conflicting process or change the DB port.
- Primary‑standby link not established – check and stop the firewall:
systemctl status firewalld.service
systemctl stop firewalld.service
Cluster State Abnormal
- All instances
UnKnown, mass failover, or frequentConnection reset by peer. - If ping works but SSH doesn't, resources may be exhausted.
- Check NIC errors:
ifconfig enp125s0f0 # watch 'dropped' and 'errors'
- Verify kernel parameters:
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
Connection Establishment Failures
-
Connection refused: verify port match, listening status, and process presence. - Failed to get connection descriptors – logs may show:
can not accept connection in pending modethe database system is starting upcan not accept connection in standby mode
- Use
gs_om -t status --detailto check role state and reset if needed.
SQL Interrupted by Network
- Errors like
Connection reset by peerorConnection timed out. - Run
gs_checkto audit network config, and look for core dumps or recent role changes.
3. Disk Failures
Typical disk issues: insufficient space, bad blocks, unmounted filesystems.
Failures that corrupt the filesystem (instance stays Unknown):
- Logs contain
data path disc writable test failed. - Unmounted disk – directory permissions appear abnormal.
- Bad blocks:
badblocks /dev/sdb1 -s -v
Failures that don’t corrupt the filesystem but kill the process:
- Logs show
No space left on deviceorinvalid page header. - Check with:
df -h
(Example output shows /dev/sdb1 at 9% usage — normal.)
4. Database Troubleshooting
Logs
The server log provides the most direct clues about startup, runtime, and shutdown issues.
Key Views
-
pg_stat_activity– current session details. -
pg_thread_wait_status– per‑thread wait events. -
pg_locks– lock information.
Core Files
When a backend crashes, the core file is invaluable for debugging. Core dumps can impact performance and consume disk space, so address them promptly.
Set the core file pattern:
cat /proc/sys/kernel/core_pattern
# e.g. /data/core/core-%e-%p-%t
By methodically checking these four layers — OS, network, disk, and database internals — you can swiftly pinpoint the culprit behind most GBase 8c outages and restore service with confidence.
Top comments (0)