Troubleshooting for abnormal downtime of nodes in Galera Cluster

background Before Group Replication was released, MySQL official replication had asynchronous and semi synchronous. At that time, most companies would...

background

Before Group Replication was released, MySQL official replication had asynchronous and semi synchronous. At that time, most companies would choose Galera cluster to make up for the full synchronization scheme, mainly including PXC of percona server and MGC of MariaDB, which are embedded in their own versions. In this paper, a case of downtime in the use of Galera cluster (MGC) in a customer's production environment is presented

environmental information

  • MariaDB 10.0.15
  • redhat 6.5

log information

Node 2 (normal) log:
190308 17:08:43 [Note] WSREP: Member 0.0 (node23) requested state transfer from '*any*'. Selected 1.0 (node144)(SYNCED) as donor. 190308 17:08:43 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 397258687) 190308 17:08:43 [Note] WSREP: IST request: a6befc67-f455-11e6-a8e6-fa93a785f2f6:397258655-397258656|tcp://21.244.57.46:4568 190308 17:08:43 [Note] WSREP: IST first seqno 397258656 not found from cache, falling back to SST 190308 17:08:43 [Warning] WSREP: SST request is null, SST canceled.
Node 3 (downtime) log:
190308 17:08:43 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 397258687) 190308 17:08:43 [Note] WSREP: Requesting state transfer: success after 2 tries, donor: 1 190308 17:08:43 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 0 bytes 190308 17:08:43 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 1/31 locked buffers 190308 17:08:43 [Warning] WSREP: 1.0 (node144): State transfer to 0.0 (node23) failed: -125 (Operation canceled) 190308 17:08:43 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():723: Will never receive state. Need to abort.

Conclusion:

  • The MySQL instance of node 2 serves as the donor of node 3. From the log of node 2, we can see that transaction a6befc67-f455-11e6-a8e6-fa93a785f2f6:397258655-397258656 is no longer in gcache (lost), which leads to the failure of node 3 IST. Only the MySQL instance of node 3 can be restarted, and the cluster can be fully and synchronously rejoined through SST.

Recommendation:

  • Increase the value of gcache.size parameter, so that more transactions can be stored in gcache

3 December 2019, 20:42 | Views: 9337

Add new comment

For adding a comment, please log in
or create account

0 comments