Emergency Handling for GBase Database Failures (2)

#database

Abnormal Resource Usage

1.1 Increased Swap Usage

Description

A significant number of nodes in the cluster exhibit high swap usage.

Analysis

This issue may be caused by a GBase software anomaly or abnormal SQL leading to memory overflow. If not addressed promptly, the growing memory usage can fill up the swap space and cause system crashes.

Emergency Handling Procedure

This anomaly is usually due to GBase software or abnormal SQL. Notify the relevant application team to assist in diagnosing the root cause.

1) Operations team contacts the open platform for assistance and notifies GBase on-site support to help diagnose the issue.
2) The operations team and GBase on-site support analyze the abnormal SQL running in the system.
3) The operations team stops the problematic SQL.
4) The open platform cleans up the operating system memory to reduce swap usage.
5) GBase on-site support helps developers optimize the abnormal SQL.
6) The operations team ensures that untested SQL does not run in the production environment.

1.2 Increased CPU Usage

Description

A significant number of nodes in the cluster have high CPU usage, and I/O is nearing saturation.

Analysis

Most CPU time is spent on system switching due to high concurrency in GBase and the presence of several long-running tasks (over 2 hours).

Emergency Handling Procedure

This anomaly is often caused by excessive business scheduling concurrency, which reduces the overall task processing speed.

1) The operations team contacts the open platform for assistance and notifies GBase on-site support to help diagnose the issue.
2) The operations team and GBase on-site support analyze the number of concurrent tasks running in the system.
3) If concurrency is too high, the operations team reduces the concurrency. If there are long-running SQL tasks, decide whether to kill them to avoid degrading overall performance.
4) GBase on-site support helps developers optimize abnormal SQL and ensures that untested SQL does not run in the production environment.
5) The operations team avoids manually starting jobs outside the unified scheduling system.

1.3 Abnormally Busy Disk I/O

Description

A single node or multiple nodes in the cluster exhibit abnormally busy I/O.

Analysis

This issue may be caused by hardware failures such as hard disk, backplane, or RAID card issues, leading to reduced overall task processing speed.

Emergency Handling Procedure

This anomaly is usually caused by hardware failures.

1) The operations team contacts the open platform to confirm the issue and proceed with follow-up actions.
2) The hardware vendor captures hardware operation logs and analyzes them.
3) If concurrency is too high, the operations team reduces the concurrency. If there are long-running SQL tasks, decide whether to kill them to avoid degrading overall performance.
4) Once the vendor analyzes the logs and identifies the faulty hardware, they replace it. If replacing the hardware requires stopping the operating system, the cluster services must be stopped.
5) GBase on-site support restores the service.

1.4 Disk Space Full or Exceeding Threshold

Description

A single node or multiple nodes in the cluster have full or over 80% disk space usage.

Analysis

Nodes in the cluster have disk space usage exceeding 80%. Since GBase cluster data nodes must reserve 20%-30% as temporary space, once the total disk space is full, some SQL operations will fail, and GBase service processes may crash.

Emergency Handling Procedure

A sudden increase in disk usage is usually caused by Cartesian product SQL or GBase execution plan bugs.

1) The operations team analyzes the temporary space usage in GBase.
2) The operations team analyzes the running SQL to identify the problematic SQL.
3) Kill the SQL and observe if the space is released.
4) If it is a Cartesian product issue, notify the development team for processing. If it is a GBase execution plan issue, report it to the database vendor and request a short-term solution and a long-term fix.
5) Restore the service.