DEV Community

Michael
Michael

Posted on • Originally published at gbase.cn

Troubleshooting GBase 8c: OS, Network, Disk, and Database

When a GBase 8c node goes offline or behaves erratically, narrowing down the root cause quickly is essential. This guide walks through the common failure scenarios for GBASE's China‑domestically developed database and the commands you’ll use to diagnose them.

1. Operating System Issues

If all instances on a node are down, suspect the OS first.

  • Cannot SSH – Ping the host. No response likely means a network outage, kernel panic, or reboot. A panic‑induced reboot can take up to 20 minutes; retry every 5 minutes. If it’s still dead after 20 minutes, a site visit is needed.
  • Ping works, but SSH freezes – Usually CPU or I/O exhaustion. Retry a few times; if no luck within 5 minutes, physical intervention is required.
  • System responds but is sluggish – Collect OS metrics with:
who
cat /etc/openEuler-release
uname -a
sysctl -a
cat /etc/sysctl.conf
cat /proc/cpuinfo
cat /proc/meminfo
top -H
iostat -x 1 3
vmstat 1 3
# Inspect logs: /var/log/messages or dmesg
Enter fullscreen mode Exit fullscreen mode

The watchdog timer (default 60s) will reset the system if it hangs.

2. Network Problems

Network faults often cause startup failures, UnKnown instance states, or sudden disconnections.

Startup Failure Due to Network Errors

  • Port conflict:
netstat -anop | grep 15400
# Kill the conflicting process or change the DB port.
Enter fullscreen mode Exit fullscreen mode
  • Primary‑standby link not established – check and stop the firewall:
systemctl status firewalld.service
systemctl stop firewalld.service
Enter fullscreen mode Exit fullscreen mode

Cluster State Abnormal

  • All instances UnKnown, mass failover, or frequent Connection reset by peer.
  • If ping works but SSH doesn't, resources may be exhausted.
  • Check NIC errors:
ifconfig enp125s0f0   # watch 'dropped' and 'errors'
Enter fullscreen mode Exit fullscreen mode
  • Verify kernel parameters:
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
Enter fullscreen mode Exit fullscreen mode

Connection Establishment Failures

  • Connection refused: verify port match, listening status, and process presence.
  • Failed to get connection descriptors – logs may show:
    • can not accept connection in pending mode
    • the database system is starting up
    • can not accept connection in standby mode
  • Use gs_om -t status --detail to check role state and reset if needed.

SQL Interrupted by Network

  • Errors like Connection reset by peer or Connection timed out.
  • Run gs_check to audit network config, and look for core dumps or recent role changes.

3. Disk Failures

Typical disk issues: insufficient space, bad blocks, unmounted filesystems.

Failures that corrupt the filesystem (instance stays Unknown):

  • Logs contain data path disc writable test failed.
  • Unmounted disk – directory permissions appear abnormal.
  • Bad blocks:
badblocks /dev/sdb1 -s -v
Enter fullscreen mode Exit fullscreen mode

Failures that don’t corrupt the filesystem but kill the process:

  • Logs show No space left on device or invalid page header.
  • Check with:
df -h
Enter fullscreen mode Exit fullscreen mode

(Example output shows /dev/sdb1 at 9% usage — normal.)

4. Database Troubleshooting

Logs

The server log provides the most direct clues about startup, runtime, and shutdown issues.

Key Views

  • pg_stat_activity – current session details.
  • pg_thread_wait_status – per‑thread wait events.
  • pg_locks – lock information.

Core Files

When a backend crashes, the core file is invaluable for debugging. Core dumps can impact performance and consume disk space, so address them promptly.

Set the core file pattern:

cat /proc/sys/kernel/core_pattern
# e.g. /data/core/core-%e-%p-%t
Enter fullscreen mode Exit fullscreen mode

By methodically checking these four layers — OS, network, disk, and database internals — you can swiftly pinpoint the culprit behind most GBase 8c outages and restore service with confidence.

Top comments (0)