help：Mariadb10.11.6 Galera single failed node startup stuck failure

Basic background information Mariadb Ver 15.1 District 10.11.6 MariaDB Galera cluster, one with three nodes: Node1:192.168.18.78 Node2: 192.168.18.79 Node3: 192.168.18.80

Among them, Node1 node was restarted after a power outage of 1 hour, and after executing the system ctl start mariadb, it was stuck for a long time (running for 6 hours) but still did not recover.

The configuration information of Galera is as follows:

[mysqld]
event_scheduler=ON
bind-address=0.0.0.0

# Galera 提供者配置
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so

# Galera 集群配置
wsrep_cluster_name="hy_galera_cluster"
wsrep_cluster_address="gcomm://192.168.18.78,192.168.18.79,192.168.18.80"

# Galera 节点配置
wsrep_node_address="192.168.18.78"
wsrep_node_name="data-server"

# SST 方法选择
wsrep_sst_method=rsync

# InnoDB Configuration
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
binlog_format=ROW

The log input situation is as follows:

240403 05:05:09 mysqld_safe Starting mariadbd daemon with databases from /var/lib/mysql
240403 05:05:09 mysqld_safe WSREP: Running position recovery with --disable-log-error  --pid-file='/var/lib/mysql/data-server-recover.pid'
240403 05:05:09 mysqld_safe WSREP: Recovered position 20c1183c-e5c5-11ee-9129-97e9406cb3f8:7183126
2024-04-03  5:05:10 0 [Note] Starting MariaDB 10.11.6-MariaDB source revision fecd78b83785d5ae96f2c6ff340375be803cd299 as process 233407
2024-04-03  5:05:10 0 [Note] WSREP: Loading provider /usr/lib64/galera/libgalera_smm.so initial position: 20c1183c-e5c5-11ee-9129-97e9406cb3f8:7183126
2024-04-03  5:05:10 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
2024-04-03  5:05:10 0 [Note] WSREP: wsrep_load(): Galera 26.4.16(rXXXX) by Codership Oy <info@codership.com> loaded successfully.
2024-04-03  5:05:10 0 [Note] WSREP: Initializing allowlist service v1
2024-04-03  5:05:10 0 [Note] WSREP: CRC-32C: using 64-bit x86 acceleration.
2024-04-03  5:05:10 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 0
2024-04-03  5:05:10 0 [Note] WSREP: GCache DEBUG: opened preamble:
Version: 2
UUID: 20c1183c-e5c5-11ee-9129-97e9406cb3f8
Seqno: -1 - -1
Offset: -1
Synced: 0
2024-04-03  5:05:10 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 20c1183c-e5c5-11ee-9129-97e9406cb3f8, offset: -1
2024-04-03  5:05:10 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (        0/134217752 bytes) complete.
2024-04-03  5:05:10 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete.
2024-04-03  5:05:10 0 [Note] WSREP: Recovering GCache ring buffer: Recovery failed, need to do full reset.
2024-04-03  5:05:10 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.18.78; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.keep_plaintext_size = 128M; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0
2024-04-03  5:05:10 0 [Note] WSREP: Start replication
2024-04-03  5:05:10 0 [Note] WSREP: Connecting with bootstrap option: 0
2024-04-03  5:05:10 0 [Note] WSREP: Setting GCS initial position to 00000000-0000-0000-0000-000000000000:-1
2024-04-03  5:05:10 0 [Note] WSREP: protonet asio version 0
2024-04-03  5:05:10 0 [Note] WSREP: Using CRC-32C for message checksums.
2024-04-03  5:05:10 0 [Note] WSREP: backend: asio
2024-04-03  5:05:10 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 
2024-04-03  5:05:10 0 [Note] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
2024-04-03  5:05:10 0 [Note] WSREP: restore pc from disk failed
2024-04-03  5:05:10 0 [Note] WSREP: GMCast version 0
2024-04-03  5:05:10 0 [Note] WSREP: (b0bc65f1-8af3, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2024-04-03  5:05:10 0 [Note] WSREP: (b0bc65f1-8af3, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2024-04-03  5:05:10 0 [Note] WSREP: EVS version 1
2024-04-03  5:05:10 0 [Note] WSREP: gcomm: connecting to group 'hy_galera_cluster', peer '192.168.18.78:,192.168.18.79:,192.168.18.80:'
2024-04-03  5:05:10 0 [Note] WSREP: (b0bc65f1-8af3, 'tcp://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp://192.168.18.78:4567
2024-04-03  5:05:10 0 [Note] WSREP: (b0bc65f1-8af3, 'tcp://0.0.0.0:4567') connection established to e1facb37-96cc tcp://192.168.18.80:4567
2024-04-03  5:05:10 0 [Note] WSREP: (b0bc65f1-8af3, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2024-04-03  5:05:10 0 [Note] WSREP: (b0bc65f1-8af3, 'tcp://0.0.0.0:4567') connection established to e8ab0109-98a4 tcp://192.168.18.79:4567
2024-04-03  5:05:10 0 [Note] WSREP: EVS version upgrade 0 -> 1
2024-04-03  5:05:10 0 [Note] WSREP: declaring e1facb37-96cc at tcp://192.168.18.80:4567 stable
2024-04-03  5:05:10 0 [Note] WSREP: declaring e8ab0109-98a4 at tcp://192.168.18.79:4567 stable
2024-04-03  5:05:10 0 [Note] WSREP: PC protocol upgrade 0 -> 1
2024-04-03  5:05:10 0 [Note] WSREP: Node e1facb37-96cc state prim
2024-04-03  5:05:10 0 [Note] WSREP: view(view_id(PRIM,b0bc65f1-8af3,46) memb {
    b0bc65f1-8af3,0
    e1facb37-96cc,0
    e8ab0109-98a4,0
} joined {
} left {
} partitioned {
})
2024-04-03  5:05:10 0 [Note] WSREP: save pc into disk
2024-04-03  5:05:10 0 [Note] WSREP: gcomm: connected
2024-04-03  5:05:10 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2024-04-03  5:05:10 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2024-04-03  5:05:10 0 [Note] WSREP: Opened channel 'hy_galera_cluster'
2024-04-03  5:05:10 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2024-04-03  5:05:10 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: b108e94c-f134-11ee-ac13-321fb976ab0c
2024-04-03  5:05:10 1 [Note] WSREP: Starting rollbacker thread 1
2024-04-03  5:05:10 2 [Note] WSREP: Starting applier thread 2
2024-04-03  5:05:10 0 [Note] WSREP: STATE EXCHANGE: sent state msg: b108e94c-f134-11ee-ac13-321fb976ab0c
2024-04-03  5:05:10 0 [Note] WSREP: STATE EXCHANGE: got state msg: b108e94c-f134-11ee-ac13-321fb976ab0c from 0 (data-server)
2024-04-03  5:05:10 0 [Note] WSREP: STATE EXCHANGE: got state msg: b108e94c-f134-11ee-ac13-321fb976ab0c from 1 (web02-server)
2024-04-03  5:05:10 0 [Note] WSREP: STATE EXCHANGE: got state msg: b108e94c-f134-11ee-ac13-321fb976ab0c from 2 (web01-server)
2024-04-03  5:05:10 0 [Note] WSREP: Quorum results:
    version    = 6,
    component  = PRIMARY,
    conf_id    = 44,
    members    = 2/3 (joined/total),
    act_id     = 7339907,
    last_appl. = 7339849,
    protocols  = 2/10/4 (gcs/repl/appl),
    vote policy= 0,
    group UUID = 20c1183c-e5c5-11ee-9129-97e9406cb3f8
2024-04-03  5:05:10 0 [Note] WSREP: Flow-control interval: [28, 28]
2024-04-03  5:05:10 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 7339908)
2024-04-03  5:05:10 2 [Note] WSREP: ####### processing CC 7339908, local, ordered
2024-04-03  5:05:10 2 [Note] WSREP: Process first view: 20c1183c-e5c5-11ee-9129-97e9406cb3f8 my uuid: b0bc65f1-f134-11ee-8af3-66b2cec80bb4
2024-04-03  5:05:10 2 [Note] WSREP: Server data-server connected to cluster at position 20c1183c-e5c5-11ee-9129-97e9406cb3f8:7339908 with ID b0bc65f1-f134-11ee-8af3-66b2cec80bb4
2024-04-03  5:05:10 2 [Note] WSREP: Server status change disconnected -> connected
2024-04-03  5:05:10 2 [Note] WSREP: ####### My UUID: b0bc65f1-f134-11ee-8af3-66b2cec80bb4
2024-04-03  5:05:10 2 [Note] WSREP: Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes
2024-04-03  5:05:10 0 [Note] WSREP: Service thread queue flushed.
2024-04-03  5:05:10 2 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: -1
2024-04-03  5:05:10 2 [Note] WSREP: State transfer required: 
    Group state: 20c1183c-e5c5-11ee-9129-97e9406cb3f8:7339908
    Local state: 00000000-0000-0000-0000-000000000000:-1
2024-04-03  5:05:10 2 [Note] WSREP: Server status change connected -> joiner
2024-04-03  5:05:10 0 [Note] WSREP: Joiner monitor thread started to monitor
2024-04-03  5:05:10 0 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '192.168.18.78' --datadir '/var/lib/mysql/' --parent 233407 --progress 0 --mysqld-args --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mariadb/plugin --user=mysql --wsrep_on=ON --wsrep_provider=/usr/lib64/galera/libgalera_smm.so --log-error=/data/log/mariadb/mariadb.log --pid-file=/run/mariadb/mariadb.pid --socket=/var/lib/mysql/mysql.sock --wsrep_start_position=20c1183c-e5c5-11ee-9129-97e9406cb3f8:7183126'
WSREP_SST: [INFO] rsync SST started on joiner (20240403 05:05:10.645)
2024-04-03  5:05:11 2 [Note] WSREP: ####### IST uuid:00000000-0000-0000-0000-000000000000 f: 0, l: 7339908, STRv: 3
2024-04-03  5:05:11 2 [Note] WSREP: IST receiver addr using tcp://192.168.18.78:4568
2024-04-03  5:05:11 2 [Note] WSREP: Prepared IST receiver for 0-7339908, listening at: tcp://192.168.18.78:4568
2024-04-03  5:05:11 0 [Note] WSREP: Member 0.0 (data-server) requested state transfer from '*any*'. Selected 1.0 (web02-server)(SYNCED) as donor.
2024-04-03  5:05:11 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 7339908)
2024-04-03  5:05:11 2 [Note] WSREP: Requesting state transfer: success, donor: 1
2024-04-03  5:05:11 2 [Note] WSREP: Resetting GCache seqno map due to different histories.
2024-04-03  5:05:11 2 [Note] WSREP: GCache history reset: 20c1183c-e5c5-11ee-9129-97e9406cb3f8:0 -> 20c1183c-e5c5-11ee-9129-97e9406cb3f8:7339908
2024-04-03  5:05:13 0 [Note] WSREP: (b0bc65f1-8af3, 'tcp://0.0.0.0:4567') turning message relay requesting off

Multiple attempts to execute the startup command have resulted in getting stuck and unable to complete the normal startup. At the same time, the entire cluster is unable to write new data normally. After forcibly killing the startup command, new data can be written to the cluster normally.

DEV Community

help：Mariadb10.11.6 Galera single failed node startup stuck failure

Top comments (0)