Michael

Posted on May 8 • Originally published at gbase.cn

Troubleshooting GBase 8c: OS, Network, Disk, and Database

#gbase #database #数据库

When a GBase 8c node goes offline or behaves erratically, narrowing down the root cause quickly is essential. This guide walks through the common failure scenarios for GBASE's China‑domestically developed database and the commands you’ll use to diagnose them.

1. Operating System Issues

If all instances on a node are down, suspect the OS first.

Cannot SSH – Ping the host. No response likely means a network outage, kernel panic, or reboot. A panic‑induced reboot can take up to 20 minutes; retry every 5 minutes. If it’s still dead after 20 minutes, a site visit is needed.
Ping works, but SSH freezes – Usually CPU or I/O exhaustion. Retry a few times; if no luck within 5 minutes, physical intervention is required.
System responds but is sluggish – Collect OS metrics with:

who
cat /etc/openEuler-release
uname -a
sysctl -a
cat /etc/sysctl.conf
cat /proc/cpuinfo
cat /proc/meminfo
top -H
iostat -x 1 3
vmstat 1 3
# Inspect logs: /var/log/messages or dmesg

The watchdog timer (default 60s) will reset the system if it hangs.

2. Network Problems

Network faults often cause startup failures, UnKnown instance states, or sudden disconnections.

Startup Failure Due to Network Errors

Port conflict:

netstat -anop | grep 15400
# Kill the conflicting process or change the DB port.

Primary‑standby link not established – check and stop the firewall:

systemctl status firewalld.service
systemctl stop firewalld.service

Cluster State Abnormal

All instances UnKnown, mass failover, or frequent Connection reset by peer.
If ping works but SSH doesn't, resources may be exhausted.
Check NIC errors:

ifconfig enp125s0f0   # watch 'dropped' and 'errors'

Verify kernel parameters:

net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15

Connection Establishment Failures

Connection refused: verify port match, listening status, and process presence.
Failed to get connection descriptors – logs may show:
- can not accept connection in pending mode
- the database system is starting up
- can not accept connection in standby mode
Use gs_om -t status --detail to check role state and reset if needed.

SQL Interrupted by Network

Errors like Connection reset by peer or Connection timed out.
Run gs_check to audit network config, and look for core dumps or recent role changes.

3. Disk Failures

Typical disk issues: insufficient space, bad blocks, unmounted filesystems.

Failures that corrupt the filesystem (instance stays Unknown):

Logs contain data path disc writable test failed.
Unmounted disk – directory permissions appear abnormal.
Bad blocks:

badblocks /dev/sdb1 -s -v

Failures that don’t corrupt the filesystem but kill the process:

Logs show No space left on device or invalid page header.
Check with:

df -h

(Example output shows /dev/sdb1 at 9% usage — normal.)

4. Database Troubleshooting

Logs

The server log provides the most direct clues about startup, runtime, and shutdown issues.

Key Views

pg_stat_activity – current session details.
pg_thread_wait_status – per‑thread wait events.
pg_locks – lock information.

Core Files

When a backend crashes, the core file is invaluable for debugging. Core dumps can impact performance and consume disk space, so address them promptly.

Set the core file pattern:

cat /proc/sys/kernel/core_pattern
# e.g. /data/core/core-%e-%p-%t

By methodically checking these four layers — OS, network, disk, and database internals — you can swiftly pinpoint the culprit behind most GBase 8c outages and restore service with confidence.

DEV Community