DEV Community

Cong Li
Cong Li

Posted on

Troubleshooting GBase 8c Database Scaling Failure

GBase 8c is a multi-mode, distributed database that supports horizontal scaling, allowing for expansion and contraction operations. This article outlines how to troubleshoot issues when scaling operations fail.

Normally, when using the gha_ctl tool for scaling, two success messages are expected. The first success checks whether the input parameters are correct and if the data directory is not empty. If this step fails, you can modify the parameters and try again. The second success indicates the start of subprocesses for expansion or contraction, which can take some time.

1. Overview of Scaling Operations

Expansion

When expanding, the operation progresses through several phases. You can check the phase of failure by running the command gha_ctl get expand history -l $dcslist. The phases are:

  • add_primary: Initialize the DN node and add the DN host.
  • prepare: Check the node group to be expanded, create the target node group, and set the source and target node groups for expansion.
  • execute: Use the gs_redis tool to redistribute data.
  • add_standby: Add the standby DN node.
  • clean_data: Change the expansion status to "end."

Contraction

For contraction, the process also progresses through several phases. You can check the phase of failure using the same command gha_ctl get expand history -l $dcslist. Contraction only removes nodes without adding new ones:

  • prepare: Check the node group to be contracted, create the target node group, and set the source and target node groups for contraction.
  • execute: Use the gs_redis tool to redistribute data.
  • drop_group: Drop the shrinking datanode group.
  • clean_data: Change the expansion status to "end."

2. Failure Overview and Case Studies

When failures occur in the add_primary, add_standby, or prepare phases, you should check the logs at /var/log/messages and /tmp/gha_ctl/gha_ctl.log on the gha_server (or the primary gha_server in multi-server setups). On the added DN nodes, check the logs under $GAUSSLOG/gbase/om/gs_expansion***.

If a failure occurs during the execute phase, error messages will often state that "gs_redis failed on...". In this case, check the gs_redis logs, which can be found in the $GAUSSLOG/bin/gs_redis directory on one of the CN nodes. Additionally, check the CN node's pg_log directory for more detailed error information.

(1) Case Study 1

The following issue occurred:

[gbase@gbase8c-82 script]$ ./gha_ctl expand datanode 'dn4 (dn4_1 100.0.0.84 30010 /home/gbase/data/dn4/dn4_1 8020)' -l http://100.0.0.82:2379,http://100.0.0.83:2379,http://100.0.0.84:2379 -u 40ac7d83-6be3-486c-83c4-8942a16d3590
{
"ret": 0,
"msg": "Success"
}
[gbase@gbase8c-82 script]$ {
"ret": -1,
"msg": "Init fail"
}
Enter fullscreen mode Exit fullscreen mode

Troubleshooting Steps:

First, check which phase the failure occurred in:

[gbase@gbase8c-82 script]$ ./gha_ctl get expand history -l http://100.0.0.82:2379,http://100.0.0.83:2379,http://100.0.0.84:2379
{
"state": "idle",
"current": "",
"history": [
{
"time": "2022-12-29 10:27:59",
"uuid": "40ac7d83-6be3-486c-83c4-8942a16d3590",
"phase": "add_primary",
"status": "failed",
"info": {
"dn4": [
{
"name": "dn4_1",
"host": "100.0.0.84",
"port": "30010",
"work_dir": "/home/gbase/data/dn4/dn4_1",
"agent_port": "8020",
"role": "primary",
"agent_host": "100.0.0.84"
}
]
}
}
]
}
Enter fullscreen mode Exit fullscreen mode

The failure occurred during the "add_primary" phase. Checking the gs_expansion*** log on node 84, no errors were found. However, checking the /tmp/gha_ctl/gha_ctl.log on the gha_server revealed the following:

2022-12-29 10:28:04 gaussdb.py expansion 89 DEBUG 345309 Execute expansion command in [100.0.0.84]: source ~/.bashrc;gs_expansion -U gbase -G gbase -X /tmp/gs_gha_2022-12-29_10:28:02_796027/clusterconfig.xml -h 100.0.0.84 --from-gha --inst-name dn4_1 --group-name dn4
2022-12-29 10:28:08 command_util.py execute 249 DEBUG 345309 cmd:ssh -E /dev/null -p 22 gbase@100.0.0.84 "source ~/.bashrc;gs_expansion -U gbase -G gbase -X /tmp/gs_gha_2022-12-29_10:28:02_796027/clusterconfig.xml -h 100.0.0.84 --from-gha --inst-name dn4_1 --group-name dn4", status:1, output:[GAUSS-51100] : Failed to verify SSH trust on these nodes:
gbase8c-82, gbase8c-83, gbase8c-84, 100.0.0.82, 100.0.0.83, 100.0.0.84 by individual user.
2022-12-29 10:28:08 instance.py init 1614 INFO 345309 Node dn4_1 init error:Failed to execute the command: source ~/.bashrc;gs_expansion -U gbase -G gbase -X /tmp/gs_gha_2022-12-29_10:28:02_796027/clusterconfig.xml -h 100.0.0.84 --from-gha --inst-name dn4_1 --group-name dn4. Error:
Run cmd failed:cmd[ssh -E /dev/null -p 22 gbase@100.0.0.84 "source ~/.bashrc;gs_expansion -U gbase -G gbase -X /tmp/gs_gha_2022-12-29_10:28:02_796027/clusterconfig.xml -h 100.0.0.84 --from-gha --inst-name dn4_1 --group-name dn4"], msg[[GAUSS-51100] : Failed to verify SSH trust on these nodes:
gbase8c-82, gbase8c-83, gbase8c-84, 100.0.0.82, 100.0.0.83, 100.0.0.84 by individual user.]
2022-12-29 10:28:08 common.py add_one_node 190 ERROR 345309 init one node dn4_1 failed, code: -1, response: Init fail
Enter fullscreen mode Exit fullscreen mode

The issue was caused by SSH trust not being configured between the nodes. After configuring SSH trust, the expansion operation succeeded.

(2) Case Study 2

The following issue occurred:

[gbase@gbase8c-82 script]$ ./gha_ctl expand datanode 'dn4 (dn4_1 100.0.0.84 30010 /home/gbase/data/dn4/dn4_1 8020)' -l http://100.0.0.82:2379,http://100.0.0.83:2379,http://100.0.0.84:2379 -u 40ac7d83-6be3-486c-83c4-8942a16d3590
{
"ret": 0,
"msg": "Success"
}
[gbase@gbase8c-82 script]$ {
"ret": -1,
"msg": "gs_redis on cn1 failed"
}
Enter fullscreen mode Exit fullscreen mode

Troubleshooting Steps:

Based on the error message, the failure occurred while executing gs_redis. Checking the gs_redis log on cn1 at $GAUSSLOG/bin/gs_redis, we found the following:

tid[392445]: INFO: redistributing database "postgres"
tid[392445]: INFO: lock schema postgres.public
INFO: please do not close this session until you are done adding the new node
CONTEXT: referenced column: pgxc_lock_for_transfer
tid[392445]: INFO: redistributing table "spatial_ref_sys"
tid[392445]: INFO: ---- 1. setup table spatial_ref_sys ----
tid[392445]: ERROR: query failed: ERROR: dn4: relation "public.spatial_ref_sys" does not exist
DETAIL: query was: ALTER TABLE public.spatial_ref_sys SET (append_mode=on,rel_cn_oid =17324)
Enter fullscreen mode Exit fullscreen mode

We logged into dn4 and found that the postgres database indeed did not have the public.spatial_ref_sys table.

Top comments (0)