DEV Community

Cong Li
Cong Li

Posted on

GBase Database Cluster Monitoring

1. Cluster Status Monitoring

1.1 Viewing Cluster Status

To monitor the running status of each node in the cluster, including the status of gcware services, gcluster, gnode, and syncserver services, as well as data consistency and node availability, you can use the following three methods:

1) Frontend Method:

Execution Frequency: Every hour

Execution Environment: Unified monitoring platform of the Big Data Platform

Execution Steps:

  • Log in to the unified monitoring platform of the Big Data Platform and check the overall running status of the cluster on the "Home" page, as shown below:

Image description

  • Green indicates normal cluster status, while red indicates an abnormal cluster status.

    • To check the running status of all nodes in a specific cluster, click on "Monitoring Center --> GBase Environment --> Cluster Status Monitoring".

2) Backend Method:

Execution Frequency: Every hour

Execution Environment: ORACLE monitoring database

Execution Steps:

Description: The DWMS_DATABASE_INFO table in the monitoring database records the status of all production environment nodes, collected every minute using gcadmin. The field meanings are consistent with those in gcadmin.

  • To check for node abnormalities:
  select cluster_id, node_id, gnode, gcluster, data_state, node_state from dwms_database_info where valid_flag = 0 and (gnode = 'CLOSE' or gcluster = 'CLOSE' or data_state = 1 or node_state = 'Offline');
Enter fullscreen mode Exit fullscreen mode

If there are records, it indicates an abnormal cluster status.

  • To check if the loader has any abnormalities:
  select cluster_id, node_id from dwms_node_conf where (cluster_id, node_id) not in (select cluster_id, node_id from dwms_database_info where valid_flag = 0);
Enter fullscreen mode Exit fullscreen mode

3) Manual Script Execution

Execution Frequency: Every hour

Execution Environment: Each node in the cluster

Execution Steps:

  • Log in to any node in the cluster you want to check and use the gcadmin command to view the current status of the cluster, as shown below:

Image description

Descriptions:

  • gcware: Online is normal, Offline is abnormal
  • gnode: OPEN is normal, CLOSE is abnormal
  • gcluster: OPEN is normal, CLOSE is abnormal
  • syncserver: OPEN is normal, CLOSE is abnormal
  • datastate: 0 indicates normal node data, 1 indicates abnormal node data. If 1 appears, the cluster sync tool will automatically recover.
  • nodestate: Displays the health status of the node host. 0 is normal, 1 indicates a fault status.

1.2 Handling Cluster Status Abnormalities

Any abnormalities in gcware, gnode, gcluster, syncserver, datastate, or nodestate during monitoring require attention. Issues must be addressed, reported, and resolved in collaboration to handle the abnormal node.

2. SQL Monitoring

2.1 Viewing Current SQL

When the cluster is running slowly, you can view all SQL currently being executed in the cluster to determine which SQL statements are taking too long and whether the overall execution of SQL in the cluster is slow. Monitoring the SQL running status can help identify cluster problems or potential issues.

Three conditions to determine if the cluster's overall SQL execution is slow:

  1. There are more than 10 SQL statements running for over 1000 seconds, and no SQL executing under 10 seconds.
  2. A high proportion of DDL statements (create, drop) with most taking longer than 30 seconds.
  3. Simple SQL execution (e.g., select count(*) from table_a) taking longer than 30 seconds.

You can view the running SQL using the following three methods:

1) Frontend Method:

Execution Frequency: Every hour

Execution Environment: Unified monitoring platform of the Big Data Platform

Execution Steps:

  • Log in to the unified monitoring platform of the Big Data Platform and click on "Monitoring Center --> GBase Environment --> GBase Concurrency Monitoring".

2) Backend Method:

Execution Frequency: Every hour

Execution Environment: ORACLE monitoring database

Execution Steps:

Description: The DWMS_SQL_INFO table in the monitoring database records the SQL information currently being executed in all production environments, collected every minute.

  • To view currently executing SQL:
  select exe_node_ip, session_pid, session_time, sql_info from dwms_sql_info where valid_flag = 0 and sql_type = 'TRANS' and sql_info not like '%sub_step%' and sql_info not like '%GCLUSTER_LOCAL%' and cluster_id='BDPA' order by to_number(session_time);
Enter fullscreen mode Exit fullscreen mode
  • To view current loading processes:
  select exe_node_ip, session_pid, session_time, sql_info from dwms_sql_info where valid_flag = 0 and sql_type = 'LOAD' and sql_info not like '%sub_step%' and sql_info not like '%GCLUSTER_LOCAL%' and cluster_id='BDPA' order by to_number(session_time);
Enter fullscreen mode Exit fullscreen mode

3) Manual Script Execution:

Execution Frequency: Every minute

Execution Environment: First loader of each cluster

Execution Steps:

  • Log in to the first loader of each environment and execute:
  tail -f gbase/ping_node/always_result.txt
Enter fullscreen mode Exit fullscreen mode

to view all SQL currently being executed in the cluster. non-load process number: 46 indicates the number of non-load SQL processes, and load process number: 0 indicates the number of load processes.

Image description

Note:
Using the above three methods, you can view the problematic SQL execution time, originating node, and SQL statement. However, the SQL displayed is truncated. To view the full SQL, log in to the node where the SQL originated and execute:

show full processlist;
Enter fullscreen mode Exit fullscreen mode

Or, directly from the Linux command line:

gbase –u username –p password –h node_IP -e "show full processlist;" | grep -ivE "sleep|Daemon"
Enter fullscreen mode Exit fullscreen mode

Since the GBase cluster uses a distributed flat architecture, each node can act as the cluster entry point. Therefore, to view all SQL currently being executed in the entire cluster, you need to connect to each node and execute the above SQL.

2.2 Handling SQL Abnormalities

SQL monitoring mainly focuses on SQL statements with long execution times. Long execution times may be due to:

  1. SQL statements not optimized for GBase features, resulting in low execution efficiency.
  2. Cartesian product issues. If present, the cluster may lack resources, drastically reducing performance.

To address SQL statement issues, focus on:

  • Improper table types or distribution columns
  • Cartesian product problems

If an SQL statement is problematic and its continued execution may affect cluster usage, use the kill command to terminate the problematic SQL.

Top comments (0)