DEV Community

Cong Li
Cong Li

Posted on

GBase 8c Database Failure Case Analysis - Startup Failure

In today's information and big data environment, databases are ubiquitous, serving as the core for storing and managing data, which is crucial for the operation of enterprises. However, like all technological products, databases can encounter various issues. This article analyzes troubleshooting steps for startup failures in the GBase database, using GBase 8c V5 5.0.0 as an example.

Failure Symptoms

After installing the GBase 8c database, it fails to start normally, displaying error messages like the one below:

Image description

Troubleshooting Process

(1) Investigate the Startup Failure Logs: Start by checking the logs to confirm port occupancy information. Navigate to the log directory by executing the following command:

   cd $GAUSSLOG
Enter fullscreen mode Exit fullscreen mode

As shown in the following image:

Image description

a) Checking the latest pg_log, it was found that the kernel log contained no significant errors. This suggests that the startup signal did not reach the database kernel, leading to the conclusion that the OM tool detection failed, causing the database startup to fail.

b) Examining the om log revealed obvious errors related to port 5432 being occupied, for example, the following information was returned:

Image description

(2) Identify Port Occupancy: To check the port occupancy, execute the following command:

   netstat -anpt | grep 5432
Enter fullscreen mode Exit fullscreen mode

The output may look something like this:

Image description

This indicates that a postgres process on the machine is occupying the 5432 port required by the GBase 8c configuration file.

(3) Resolve the Port Issue: To avoid affecting other services, you can change the port of the GBase 8c service process.

a) Navigate to the installation directory:

   cd $GAUSSHOME
Enter fullscreen mode Exit fullscreen mode

b) Modify the postgresql.conf configuration file:

   vim postgresql.conf
Enter fullscreen mode Exit fullscreen mode
  Change the port number from 5432 to another, such as 15400.
Enter fullscreen mode Exit fullscreen mode

c) Restart the database service by executing:

   gs_om -t restart
Enter fullscreen mode Exit fullscreen mode
  This time, the startup succeeds without encountering the port occupancy error. Although there was a memory shortage prompt due to the demonstration machine's small memory and multiple deployed applications, this can be temporarily ignored as the focus here is on the failure analysis.
Enter fullscreen mode Exit fullscreen mode

Image description

Note: If other services occupying the port are no longer needed, you can also use the kill -9 command to terminate the service process. Proceed with caution!

Troubleshooting Approach

While this example is straightforward and the issue is obvious, in other practical environments, a comprehensive investigation may be required. The following is a general troubleshooting approach:

  • Check Logs: First, examine the database startup logs to understand the specific reason for the startup failure.
  • Check Configuration Parameters: Ensure that configuration parameters are reasonable and that the system resources are sufficient and meet internal constraints.
  • Check Data Node Status: Verify that all data nodes are functioning correctly and that there are no abnormal nodes causing the overall startup failure.
  • Check Directory Permissions: Ensure that the database data directory and key system directories (e.g., /tmp) have correct permissions set.
  • Check Port Occupancy: Confirm that the configured port is not occupied by other services.
  • Check Firewall Settings: Ensure that the system's firewall settings allow the database service to pass through.
  • Check Trust Relationships: Verify that the trust relationships between nodes are correctly established.
  • Check Machine Resource Usage: For example, check disk usage with df -Th, CPU usage with top, memory usage with free -g, and whether the primary and backup networks are functioning properly.
  • Use dmesg to Check OS Error Logs: Look for hardware failures, system reboots, or other warning messages.

Solutions

How can we solve these issues? Typically, the following approaches can be used, but make sure to analyze the specific problem identified during troubleshooting:

  • Adjust Configuration Parameters: Modify unreasonable configuration parameters based on log prompts to ensure sufficient system resources and compliance with internal constraints.
  • Repair Data Nodes: Repair or replace abnormal data nodes to ensure that all nodes are functioning normally.
  • Modify Directory Permissions: Use the chmod command to adjust directory permissions, ensuring that the database user has the necessary read and write permissions.
  • Release Ports: Stop the service occupying the port or change the database service's port number.
  • Adjust Firewall Settings: Open the database service's port in the firewall or allow specific IP addresses to access the database service.
  • Re-establish Trust Relationships: Reconfigure the trust relationships between nodes according to the GBase deployment documentation.
  • Handle Resource Shortages: Use the top command to investigate which processes are consuming resources on the machine, assess their normality, and then determine if the related processes can be optimized.
  • Resolve Hardware Failures: For hardware issues, repair or replace the hardware, ensure the operating system is in a normal state, and then restart the database.

Through this case study, it is clear that any technical product, especially database products, requires regular backups and inspections to prevent damage and failure.

Top comments (0)