DEV Community

leo
leo

Posted on

openGauss routine maintenance: check the health status of openGauss

Check the health status of openGauss
Inspection method
The gs_check tool provided by openGauss can be used to check the health status of openGauss.

Precautions

The new node check for capacity expansion can only be performed under the root user, and other scenarios must be performed under the omm user.
The -i or -e parameter must be specified, -i will check the specified single item, and -e will check multiple items in the corresponding scene configuration.
If the -i parameter does not contain the root class check item or the -e scene configuration list does not contain the root class check item, you do not need to interactively enter the root user and its password.
You can use --skip-root-items to skip the root class check included in the check item, so as not to need to enter the root privilege user and password.
Check the consistency between the expanded new node and the existing node, and execute the gs_check command on the existing node to specify the –hosts parameter to check, where the new node ip needs to be written in the hosts file.
Steps

Method 1:

Log in to the active database node as the operating system user omm.

Execute the following command to check the status of the openGauss database.

gs_check -i CheckClusterState
Among them, -i specifies the check item, pay attention to case sensitivity. Format: -i CheckClusterState, -i CheckCPU or -i CheckClusterState, CheckCPU.

The value range is the names of all supported check items. For a detailed list, see "Server Tools > gs_checkos > openGauss Status Check List" in the "openGauss Tool Reference". Users can write new check items according to their needs.

Method 2:

Log in to the active database node as the operating system user omm.

Run the following command to check the health of the openGauss database.

gs_check -e inspect
Among them, -e specifies the scene name, pay attention to case sensitivity. Format: -e inspect or -e upgrade.

The value range is all supported inspection scene names, and the default list includes: inspect (routine inspection), upgrade (pre-upgrade inspection), install (installation), binary_upgrade (in-place inspection before upgrade), health (health Check patrol), slow_node (node), longtime (long time-consuming patrol), users can write their own scenarios according to their needs.

The main function of openGauss inspection is to check whether the entire openGauss state is normal during the operation of openGauss, or before major operations (upgrade, expansion), to ensure that openGauss meets the environmental conditions and status conditions required for operation. For detailed inspection items and scenarios, see "Server Tools > gs_checkos > openGauss Status Checklist" in the openGauss Tool Reference.

example

Perform a single check result:

perfadm@lfgp000700749:/opt/huawei/perfadm/tool/script> gs_check -i CheckCPU
Parsing the check items config file successfully
Distribute the context file to remote hosts successfully
Start to health check for the cluster. Total Items:1 Nodes:3

Checking... [=========================] 1/1
Start to analysis the check result
CheckCPU....................................OK
The item run on 3 nodes. success: 3

Analysis the check result successfully
Success. All check items run completed. Total:1 Success:1 Failed:0
For more information please refer to /opt/huawei/wisequery/script/gspylib/inspection/output/CheckReport_201902193704661604.tar.gz
Local execution result:

perfadm@lfgp000700749:/opt/huawei/perfadm/tool/script> gs_check -i CheckCPU -L

2017-12-29 17:09:29 [NAM] CheckCPU
2017-12-29 17:09:29 [STD] 检查主机CPU占用率,如果idle 大于30%并且iowait 小于 30%.则检查项通过,否则检查项不通过
2017-12-29 17:09:29 [RST] OK

2017-12-29 17:09:29 [RAW]
Linux 4.4.21-69-default (lfgp000700749) 12/29/17 x86_64

17:09:24 CPU %user %nice %system %iowait %steal %idle
17:09:25 all 0.25 0.00 0.25 0.00 0.00 99.50
17:09:26 all 0.25 0.00 0.13 0.00 0.00 99.62
17:09:27 all 0.25 0.00 0.25 0.13 0.00 99.37
17:09:28 all 0.38 0.00 0.25 0.00 0.13 99.25
17:09:29 all 1.00 0.00 0.88 0.00 0.00 98.12
Average: all 0.43 0.00 0.35 0.03 0.03 99.17
Execution scene inspection results:

[perfadm@SIA1000131072 Check]$ gs_check -e inspect
Parsing the check items config file successfully
The below items require root privileges to execute:[CheckBlockdev CheckIOrequestqueue CheckIOConfigure CheckCheckMultiQueue CheckFirewall CheckSshdService CheckSshdConfig CheckCrondService CheckBootItems CheckFilehandle CheckNICModel CheckDropCache]
Please enter root privileges user[root]:root
Please enter password for user[root]:
Please enter password for user[root] on the node[10.244.57.240]:
Check root password connection successfully
Distribute the context file to remote hosts successfully
Start to health check for the cluster. Total Items:57 Nodes:2

Checking... [ ] 21/57
Checking... [=========================] 57/57
Start to analysis the check result
CheckClusterState...........................OK
The item run on 2 nodes. success: 2

CheckDBParams...............................OK
The item run on 1 nodes. success: 1

CheckDebugSwitch............................OK
The item run on 2 nodes. success: 2

CheckDirPermissions.........................OK
The item run on 2 nodes. success: 2

CheckReadonlyMode...........................OK
The item run on 1 nodes. success: 1

CheckEnvProfile.............................OK
The item run on 2 nodes. success: 2 (consistent)
The success on all nodes value:
GAUSSHOME /usr1/gaussdb/app
LD_LIBRARY_PATH /usr1/gaussdb/app/lib
PATH /usr1/gaussdb/app/bin

CheckBlockdev...............................OK
The item run on 2 nodes. success: 2

CheckCurConnCount...........................OK
The item run on 1 nodes. success: 1

CheckCursorNum..............................OK
The item run on 1 nodes. success: 1

CheckPgxcgroup..............................OK
The item run on 1 nodes. success: 1

CheckDiskFormat.............................OK
The item run on 2 nodes. success: 2

CheckSpaceUsage.............................OK
The item run on 2 nodes. success: 2

CheckInodeUsage.............................OK
The item run on 2 nodes. success: 2

CheckSwapMemory.............................OK
The item run on 2 nodes. success: 2

CheckLogicalBlock...........................OK
The item run on 2 nodes. success: 2

CheckIOrequestqueue.....................WARNING
The item run on 2 nodes. warning: 2
The warning[host240,host157] value:
On device (vdb) 'IO Request' RealValue '256' ExpectedValue '32768'
On device (vda) 'IO Request' RealValue '256' ExpectedValue '32768'

CheckMaxAsyIOrequests.......................OK
The item run on 2 nodes. success: 2

CheckIOConfigure............................OK
The item run on 2 nodes. success: 2

CheckMTU....................................OK
The item run on 2 nodes. success: 2 (consistent)
The success on all nodes value:
1500

CheckPing...................................OK
The item run on 2 nodes. success: 2

CheckRXTX...................................NG
The item run on 2 nodes. ng: 2
The ng[host240,host157] value:
NetWork[eth0]
RX: 256
TX: 256

CheckNetWorkDrop............................OK
The item run on 2 nodes. success: 2

CheckMultiQueue.............................OK
The item run on 2 nodes. success: 2

CheckEncoding...............................OK
The item run on 2 nodes. success: 2 (consistent)
The success on all nodes value:
LANG=en_US.UTF-8

CheckFirewall...............................OK
The item run on 2 nodes. success: 2

CheckKernelVer..............................OK
The item run on 2 nodes. success: 2 (consistent)
The success on all nodes value:
3.10.0-957.el7.x86_64

CheckMaxHandle..............................OK
The item run on 2 nodes. success: 2

CheckNTPD...................................OK
host240: NTPD service is running, 2020-06-02 17:00:28
host157: NTPD service is running, 2020-06-02 17:00:06

CheckOSVer..................................OK
host240: The current OS is centos 7.6 64bit.
host157: The current OS is centos 7.6 64bit.

CheckSysParams..........................WARNING
The item run on 2 nodes. warning: 2
The warning[host240,host157] value:
Warning reason: variable 'net.ipv4.tcp_retries1' RealValue '3' ExpectedValue '5'.
Warning reason: variable 'net.ipv4.tcp_syn_retries' RealValue '6' ExpectedValue '5'.

CheckTHP....................................OK
The item run on 2 nodes. success: 2

CheckTimeZone...............................OK
The item run on 2 nodes. success: 2 (consistent)
The success on all nodes value:
+0800

CheckCPU....................................OK
The item run on 2 nodes. success: 2

CheckSshdService............................OK
The item run on 2 nodes. success: 2

Warning reason: UseDNS parameter is not set; expected: no

CheckCrondService...........................OK
The item run on 2 nodes. success: 2

CheckStack..................................OK
The item run on 2 nodes. success: 2 (consistent)
The success on all nodes value:
8192

CheckSysPortRange...........................OK
The item run on 2 nodes. success: 2

CheckMemInfo................................OK
The item run on 2 nodes. success: 2 (consistent)
The success on all nodes value:
totalMem: 31.260929107666016G

CheckHyperThread............................OK
The item run on 2 nodes. success: 2

CheckTableSpace.............................OK
The item run on 1 nodes. success: 1

CheckSysadminUser...........................OK
The item run on 1 nodes. success: 1

CheckGUCConsistent..........................OK
All DN instance guc value is consistent.

CheckMaxProcMemory..........................OK
The item run on 1 nodes. success: 1

CheckBootItems..............................OK
The item run on 2 nodes. success: 2

CheckHashIndex..............................OK
The item run on 1 nodes. success: 1

CheckPgxcRedistb............................OK
The item run on 1 nodes. success: 1

CheckNodeGroupName..........................OK
The item run on 1 nodes. success: 1

CheckTDDate.................................OK
The item run on 1 nodes. success: 1

CheckDilateSysTab...........................OK
The item run on 1 nodes. success: 1

CheckKeyProAdj..............................OK
The item run on 2 nodes. success: 2

CheckProStartTime.......................WARNING
host157:
STARTED COMMAND
Tue Jun 2 16:57:18 2020 /usr1/dmuser/dmserver/metricdb1/server/bin/gaussdb --single_node -D /usr1/dmuser/dmb1/data -p 22204
Mon Jun 1 16:15:15 2020 /usr1/gaussdb/app/bin/gaussdb -D /usr1/gaussdb/data/dn1 -M standby

CheckFilehandle.............................OK
The item run on 2 nodes. success: 2

CheckRouting................................OK
The item run on 2 nodes. success: 2

CheckNICModel...............................OK
The item run on 2 nodes. success: 2 (consistent)
The success on all nodes value:
version: 1.0.1
model: Red Hat, Inc. Virtio network device

CheckDropCache..........................WARNING
The item run on 2 nodes. warning: 2
The warning[host240,host157] value:
No DropCache process is running

CheckMpprcFile..............................NG
The item run on 2 nodes. ng: 2
The ng[host240,host157] value:
There is no mpprc file

Analysis the check result successfully
Failed. All check items run completed. Total:57 Success:50 Warning:5 NG:2
For more information please refer to /usr1/gaussdb/tool/script/gspylib/inspection/output/CheckReport_inspect611.tar.gz
exception handling
If the inspection result is found to be abnormal, it can be repaired according to the following contents.

Table 1 check openGauss running status

check item

Abnormal state

Approach

CheckClusterState (check openGauss state)

openGauss does not start or the openGauss instance does not start

Use the following command to start openGauss and the instance.

""gs_om -t start
The state of openGauss is abnormal or the instance of openGauss is abnormal

Check the status of each host and instance, and troubleshoot based on the status information.

""gs_check -i CheckClusterState
CheckDBParams (check database parameters)

Database parameter error

Use the gs_guc tool to modify the database parameters to the specified values.

CheckDebugSwitch (check debug log)

Incorrect log level

Use the gs_guc tool to change log_min_messages to the specified content.

CheckDirPermissions (check directory permissions)

path permission error

Modify the corresponding directory permissions to the specified value (750/700).

""chmod 750DIR
CheckReadonlyMode (check read-only mode)

read-only mode is turned on

confirmThe disk usage of the database node has not exceeded the threshold (85% by default) and no other O&M operations are being performed.

""gs_check -i CheckDataDiskUsage
ps ux
Use the gs_guc tool to turn off openGauss read-only mode.

""gs_guc reload -N all -I all -c 'default_transaction_read_only = off'
CheckEnvProfile (check environment variables)

Inconsistent environment variables

Re-execute pre-update environment variable information.

CheckBlockdev (check disk read-ahead blocks)

The disk read-ahead block size is not 16384

Use gs_checkos to set the read-ahead block size to 16384KB, and write to the self-starting file.

""gs_checkos -i B3
CheckCursorNum (check cursor number)

Checking cursor count failed

Check whether the database can be connected normally and whether the openGauss status is normal.

CheckPgxcgroup (check redistribution status)

pgxc_group table with outstanding redistribution

Continue to complete the data redistribution operation for capacity expansion or contraction.

""gs_expand, gs_shrink
CheckDiskFormat (check disk configuration)

The disk configuration of each node is inconsistent

Change the disk specifications of each node to be the same.

CheckSpaceUsage (check disk space usage)

Insufficient free disk space

Clean up or expand the disk where the corresponding directory is located.

CheckInodeUsage (check disk index usage)

Insufficient indexes available on disk

Clean up or expand the disk where the corresponding directory is located.

CheckSwapMemory (check swap memory)

Swap memory is larger than physical memory

Reduce or disable swap memory.

CheckLogicalBlock (check disk logical block)

Disk logical block size is not 512

Use gs_checkos to modify the disk logical block size to 512KB, and write it to the boot self-starting file.

""gs_checkos -i B4
CheckIOrequestqueue (check IO request)

IO request value is not 32768

Use gs_checkos to set the IO request value to 32768, and write it to the boot self-starting file.

""gs_checkos -i B4
CheckCurConnCount (check the current number of connections) 111

The current number of connections exceeds 90% of the maximum number of connections

disconnect unusedDatabase master node connection.

CheckMaxAsyIOrequests (check the maximum asynchronous requests)

The maximum asynchronous request value is less than 104857600 or the current nodeMultiply the number of database instances by 1048576

Use gs_checkos to set the maximum asynchronous request value to 104857600 and the maximum value of the current node database instance multiplied by 1048576.

""gs_checkos -i B4
CheckMTU (check MTU value)

Inconsistent MTU values

Set the MTU of each node to 1500 or 8192.

""ifconfig eth* MTU 1500
CheckIOConfigure (check IO configuration)

IO configuration is not deadline

Use gs_checkos to set the IO configuration to deadline, and write the boot self-starting file.

""gs_checkos -i B4
CheckRXTX (check RXTX value)

Network card RX/TX value is not 4096

Use checkos to set the RX/TX value of the physical network card used by openGauss to 4096.

""gs_checkos -i B5
CheckPing (check network smoothness)

There is an openGauss IP that cannot be pinged

Check the network settings and status, and firewall status between abnormal IPs.

CheckNetWorkDrop (check network packet loss rate)

Network communication packet loss rate is higher than 1%

Check the network load and status between corresponding IPs.

CheckMultiQueue (check network card multi-queue)

The network card multi-queue is not enabled and the network card interrupt is not bound to different CPU cores

Enable NIC multi-queue and bind NIC queue interrupts to different CPU cores.

CheckEncoding (check encoding format)

The encoding format of each node is inconsistent

Write consistent encoding information in /etc/profile.

""echo "export LANG=XXX" >> /etc/profile
CheckFirewall (check firewall)

The firewall is not turned off

Turn off the firewall service.

""systemctl disable firewalld.service
systemctl stop firewalld.service
CheckMaxHandle (check the maximum number of file handles)

The maximum number of file handles is less than 1000000

Set the soft and hard limits of the maximum number of file handles in 91-nofile.conf/90-nofile.conf to 1,000,000.

""gs_checkos -i B2
CheckNTPD (check time synchronization service)

The NTPD service is not enabled or the time error exceeds one minute

Enable the NTPD service and set the clock to be consistent.

CheckSysParams (check operating system parameters)

The operating system parameter settings do not meet the requirements

Use gs_checkos for parameter setting or manually.

""gs_checkos -i B1
vim /etc/sysctl.conf
CheckTHP (check THP service)

THP service is not enabled

Set up THP service with gs_checkos.

""gs_checkos -i B6
CheckTimeZone (check time zone)

Inconsistent time zone

Set each node to the same time zone.

""cp /usr/share/zoneinfo/$primary timezone/$secondary timezone /etc/localtime
CheckCPU (check CPU)

CPU usage is too high or IO waiting time is too high

Perform CPU configuration upgrades or disk performance upgrades.

CheckSshdService (check SSHD service)

The SSHD service is not enabled

Start the SSHD service and write the boot self-starting file.

""service sshd start
echo "server sshd start" >> initFile
CheckSshdConfig (check SSHD configuration)

SSHD service configuration error

Set up the SSHD service,

""PasswordAuthentication=no;
MaxStartups=1000;
UseDNS=yes;
ClientAliveInterval=10800/ClientAliveInterval=0
and restart the service:

""server sshd start
CheckCrondService (check Crond service)

Crond service not started

Install the Crond service and enable it.

CheckStack (check stack size)

Stack size is less than 3072

Use gs_checkos set to 3072 and restart the process whose stack value is too small.

""gs_checkos -i B2
CheckSysPortRange (check system port settings)

The system ip port is not within the expected range or the openGauss port is within the system ip port

Set the system ip port range parameter to 26000-65535; set the openGauss port outside the system ip port range.

""vim /etc/sysctl.conf
CheckMemInfo (check memory information)

The memory size of each node is inconsistent

Use physical memory of the same specification.

CheckHyperThread (check hyperthreading)

CPU hyperthreading is not enabled

Enable CPU hyperthreading.

CheckTableSpace (check table space)

The table space path and the openGauss path are nested or the table space paths are nested with each other

Migrate tablespace data to a tables

Top comments (0)