DEV Community

Cong Li
Cong Li

Posted on

Emergency Handling for GBase Database Failures

1. Hardware-Level Failures

1.1 GBase Data Node Hardware-Level Failure

Symptom Description

  • GBase single node crashes or hangs.

Analysis

  • The GBase data node host may crash due to power module failures, motherboard issues, local disk failures, network interruptions, or RAID card failures that disrupt all channels, rendering the system unable to provide normal services.

Emergency Procedure

  • The GBase cluster allows one machine to leave the cluster and run temporarily but cannot operate long-term. It is necessary to stop business as soon as possible to repair the faulty hardware. Recommended steps:
    1. The operations department contacts the open platform to confirm the problem and proceed with follow-up actions.
    2. The open platform notifies the equipment maintenance vendor for on-site support to repair the faulty hardware (10 minutes).
    3. The operations department stops jobs on the faulty cluster (1 to 4 hours, depending on task size).
    4. The hardware vendor repairs the faulty machine (4 to 8 hours).
    5. GBase on-site support starts the database service, checks data synchronization status, and repairs data if anomalies are found (30 minutes).
    6. The operations department starts the cluster jobs.

1.2 Loader Hardware-Level Failure

Symptom Description

  • The loader crashes or hangs.

Analysis

  • The loader host may crash due to power module failures, motherboard issues, local disk failures, network interruptions, or RAID card failures that disrupt all channels, rendering the system unable to provide normal services.

Emergency Procedure

  • The big data platform architecture achieves high availability for loaders. Failure of one or more (not all) of the 8 loaders will not interrupt applications. Recommended steps:
    1. The operations department contacts the open platform to confirm the problem and proceed with follow-up actions.
    2. The open platform notifies the equipment maintenance vendor for on-site support to repair the faulty hardware (10 minutes).
    3. The hardware vendor repairs the faulty machine (4 to 8 hours).
    4. GBase on-site support or the operations department administrator starts the loading, application services, etc. on the loader (30 minutes).

2. Operating System-Level Failures

2.1 GBase Data Node Operating System-Level Failure

2.1.1 Operating System Corruption

Symptom Description
  • Single node operating system corruption.
Analysis
  • The GBase data node RAID card or operating system failure prevents the system from providing services, requiring OS reinstallation.
Emergency Procedure
  • The GBase cluster allows one machine to leave the cluster and run temporarily but cannot operate long-term. It is necessary to stop business as soon as possible to repair the faulty machine. Using a pre-prepared backup machine can reduce OS installation time and shorten the repair process. Recommended steps:
    1. The operations department contacts the open platform to confirm the problem and proceed with follow-up actions.
    2. Set up the backup machine, ready to join the cluster (10 minutes).
    3. The operations department stops jobs on the faulty cluster (1 to 4 hours, depending on task size).
    4. GBase on-site support stops the faulty machine, configures the backup machine IP, and synchronizes GBase data (12 to 24 hours, depending on data size).
    5. GBase on-site support starts the GBase cluster (20 minutes).
    6. The operations department starts the cluster jobs.

2.1.2 File System Failure

Symptom Description
  • File system or logical volume failure.
Analysis
  • Local disk or storage disk damage leads to file system or logical volume failure, full space, and abnormal application data read/write operations.
Emergency Procedure
  • Local disk failure causes system I/O read/write abnormalities, preventing normal service provision. Recommended steps:
    1. The operations department contacts the open platform to confirm the problem and proceed with follow-up actions.
    2. The open platform notifies the hardware maintenance vendor to check hardware logs and locate the issue.
    3. Attempt to log into the system, check system logs, and disk read/write status.
    4. Usually, local disks use RAID 5, and such failure scenarios are rare; hardware failure is more likely.
    5. The hardware vendor replaces the faulty disk.
    6. If files are lost, use backup files to restore. For GBase database file damage, use GBase synchronization for repair.
    7. GBase on-site support starts the service and observes if the issue is resolved.

2.2 Loader Operating System-Level Failure

2.2.1 Operating System Corruption

Symptom Description
  • Single node operating system corruption.
Analysis
  • The GBase data node RAID card or operating system failure prevents the system from providing services, requiring OS reinstallation.
Emergency Procedure
  • The big data platform architecture achieves high availability for loaders. Failure of one or more (not all) of the 8 loaders will not interrupt applications. Recommended steps:
    1. The operations department contacts the open platform to confirm the problem and proceed with follow-up actions.
    2. The open platform reinstalls the operating system (1 hour).
    3. The open platform configures the IP, deploys GBase loading services, client, and application services (1 hour).
    4. GBase on-site support or the operations department administrator starts the loader service.

2.2.2 File System Failure

Symptom Description
  • File system or logical volume failure.
Analysis
  • Local disk or storage disk damage leads to file system or logical volume failure, full space, and abnormal application data read/write operations.
Emergency Procedure
  • The big data platform architecture achieves high availability for loaders. Failure of one or more (not all) of the 8 loaders will not interrupt applications. Recommended steps:
    1. The operations department contacts the open platform to confirm the problem and proceed with follow-up actions.
    2. The open platform notifies the hardware maintenance vendor to check hardware logs and locate the issue.
    3. Attempt to log into the system, check system logs, and disk read/write status.
    4. Usually, local disks use RAID 5, and such failure scenarios are rare; hardware failure is more likely.
    5. The hardware vendor replaces the faulty disk.
    6. GBase on-site support or the operations department administrator starts the service and observes if the issue is resolved.

Top comments (0)