Technology sharing | why MGR consistency mode does not recommend AFTER

Source: original contribution

*The original content of GreatSQL community cannot be used without authorization. Please contact Xiaobian and indicate the source for reprint.

  • 1. Introduction
  • 2. Write consistency of AFTER
  • 3. Read consistency of AFTER
  • 4. AFTER execution process
  • 5. BEFORE execution process
  • 6. Some thinking
  • 7. Reference documents

1. Introduction

During a test, it was found that at the AFTER level, node failure would cause the cluster to fail to commit transactions. At the same time, when the transaction entered the commit phase, other nodes could not start read-only transactions. The whole cluster cannot provide services normally until the failed node is kicked out of the cluster.

The following steps first reproduce the above fault scenario:

1. Initialize a 3-node cluster. The cluster information is as follows:

    mysql> select * from performance_schema.replication_group_members;
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    | CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    | group_replication_applier | c223cde5-0719-11ec-8295-ec5c6826bca3 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.25         |
    | group_replication_applier | c22847e0-0719-11ec-b30b-ec5c6826bca3 | 127.0.0.1   |       13004 | ONLINE       | PRIMARY     | 8.0.25         |
    | group_replication_applier | c22c3752-0719-11ec-98ef-ec5c6826bca3 | 127.0.0.1   |       13002 | ONLINE       | PRIMARY     | 8.0.25         |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    3 rows in set (0.00 sec)

    mysql> select @@group_replication_member_expel_timeout;
    +------------------------------------------+
    | @@group_replication_member_expel_timeout |
    +------------------------------------------+
    |                                     1000 |
    +------------------------------------------+
    1 row in set (0.00 sec)

2. Create a table and insert a piece of data at the AFTER level

    13000-conn1
    mysql> set session group_replication_consistency='AFTER';
    Query OK, 0 rows affected (0.00 sec)

    mysql> create table t1 (c1 int primary key, c2 int); 
    Query OK, 0 rows affected (0.12 sec)

    mysql> insert into t1 values (1,1); 
    Query OK, 1 row affected (0.03 sec)

3. Forcibly kill a node due to the expel setting_ When the timeout is set to 1000 seconds, the fault node becomes UNREACHABLE

    mysql> select * from performance_schema.replication_group_members;
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    | CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    | group_replication_applier | c223cde5-0719-11ec-8295-ec5c6826bca3 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.25         |
    | group_replication_applier | c22847e0-0719-11ec-b30b-ec5c6826bca3 | 127.0.0.1   |       13004 | UNREACHABLE  | PRIMARY     | 8.0.25         |
    | group_replication_applier | c22c3752-0719-11ec-98ef-ec5c6826bca3 | 127.0.0.1   |       13002 | ONLINE       | PRIMARY     | 8.0.25         |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    3 rows in set (0.00 sec)

4. At this point, a piece of data is inserted again and cannot be returned. The statement is in the submission waiting phase

    13000-conn1
    mysql> insert into t1 values (2,2);

    13000-conn2
    mysql> select time, state, info from information_schema.processlist;
    +------+--------------------------------------------------------+--------------------------------------------------------------+
    | time | state                                                  | info                                                         |
    +------+--------------------------------------------------------+--------------------------------------------------------------+
    |    0 | executing                                              | select time, state, info from information_schema.processlist |
    |  193 | waiting for handler commit                             | Group replication applier module                             |
    |  228 | Waiting on empty queue                                 | NULL                                                         |
    |  193 | Slave has read all relay log; waiting for more updates | NULL                                                         |
    |   50 | waiting for handler commit                             | insert into t1 values (2,2)                                  |
    +------+--------------------------------------------------------+--------------------------------------------------------------+
    5 rows in set (0.01 sec)

5. Log in to another live node again and cannot perform query operation

13002-conn1
mysql> set @@group_replication_consistency='eventual';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from t1;

13002-conn2
mysql> select time, state, info from information_schema.processlist;
+------+--------------------------------------+--------------------------------------------------------------+
| time | state                                | info                                                         |
+------+--------------------------------------+--------------------------------------------------------------+
|    0 | executing                            | select time, state, info from information_schema.processlist |
|  354 | waiting for handler commit           | Group replication applier module                             |
|  403 | Waiting on empty queue               | NULL                                                         |
|  225 | waiting for handler commit           | NULL                                                         |
|   13 | Executing hook on transaction begin. | select * from t1                                             |
+------+--------------------------------------+--------------------------------------------------------------+
5 rows in set (0.01 sec)

2. Write consistency of AFTER

The first problem with the above failures is that in the MGR cluster, when the transaction read-write consistency level is set to AFTER, any single point of failure will lead to the unavailability of the cluster. In the default setting, it usually takes 5 to 10 seconds for a node to be kicked out of the cluster from failure. During this time, the entire database cannot commit write transactions. Of course, as in the above test, if you expand_ If the timeout setting is larger, it will not work normally for a longer time. This will be a disaster for online business.

Check the original worklog of MGR read-write consistency. It is found that the above phenomenon meets the design expectation and is not a bug. The following requirements of the worklog explain the above problems.

FR-08: When group_replication_consistency=AFTER or BEFORE_AND_AFTER,
       if there are unreachable members but the group still has a
       reachable majority, the transaction will wait until that
       members are reachable or leave the group.

One advantage of this processing is that it can meet the strict data consistency requirements of some businesses, but it is extremely unfriendly to general businesses. Therefore, the AFTER mechanism is not recommended for general services. However, under the non AFTER mechanism, transaction messages only reach a majority at the memory level through paxos protocol, and data dropping is not required. Therefore, if most nodes fail at the same time, there is a risk of data loss.

3. Read consistency of AFTER

The second problem with the above failure is that when node 1 is in the transaction commit phase, node 2 cannot even start a read-only transaction at the event level. At this time, node 2 cannot provide any read-write services until the failed node is kicked out of the cluster.

This problem is explained from the worklog.

FR-06: When group_replication_consistency=AFTER or BEFORE_AND_AFTER,
       since the remote ONLINE members do acknowledge the
       transaction on prepare, the new transactions on those members
       shall be held until the preceding prepared are committed.

That is, if a transaction enters the prepared phase at the remote node, it must wait for the transaction to complete the commit before starting a new transaction, regardless of the consistency level. However, one disadvantage of such processing is that AFTER level will not only reduce the performance throughput of execution nodes, but also reduce the performance of other nodes as read-only nodes.

In a multi write master deployment, the performance impact may be greater. Another problem is that if the user maliciously turns on the AFTER level to perform a large transaction operation, other nodes will be unable to start a new transaction for a long time. This problem is also mentioned in the official worklog.

SECURITY CONTEXT
================
From a point of view of malicious attack to the group, since when
group_replication_consistency=AFTER or BEFORE_AND_AFTER a
transaction will wait for a acknowledge from all ONLINE members, a
UNREACHABLE member will block a transaction execution until that
member is reachable or leaves the group.

A malicious user can set group_replication_consistency=AFTER or
BEFORE_AND_AFTER on long lived transactions, which may block new
transactions while those long lived transactions are being applied.

4. AFTER execution process

First, the process on the transaction execution node is as follows:

  1. First, AFTER the transaction enters the commit phase, a before is executed_ The HOOK of commit. In mgr, the corresponding implementation is group_replication_trans_before_commit. The consistency guarantee of AFTER is realized through this interface.
  2. Assuming that transaction T1 is executed on node M1, if it is at the AFTER level, a transaction carrying all transaction data will be sent through paxos_ with_ guarantee_ Message message, the message type is CT_TRANSACTION_WITH_GUARANTEE_MESSAGE.
  3. When the node receives the message and processes it, it will first obtain the online in the current cluster_ members. It should be noted here that even if the node status changes to UNREACHABLE, it will be considered online as long as the cluster is not kicked out_ members.
  4. Node M1 needs to wait for message feedback from other nodes
  5. Node M1 only receives the above online_ The submission can continue only when the prepared messages of all nodes in members are received

Next, let's take a look at the process of other nodes (taking M2 node as an example) processing AFTER transactions:

  1. First, paxos receives the transaction and enters the transaction execution phase
  2. Transaction T1 calls before when M2 enters the commit phase_ Hook processing. Unlike the M1 user thread, the replication thread on M2 is in gr_ APPLIER_ Execute on channel
  3. Adds a transaction to the prepared transaction list
  4. Send transaction_ The prepared message is sent to all nodes and waits for processing
  5. Transaction received from other nodes_ After the prepared message is confirmed, remove the transaction from the prepared transaction list and continue to commit

For AFTER mode, all nodes need to send a transaction when processing transactions_ The prepared message and wait for the confirmation of all nodes before the transaction executed by the user thread can be successfully committed. Excluding the time overhead of user threads waiting for all node transactions to be committed, the network overhead of these message processing will also have a certain impact on performance.

Another thing to note is that if transaction T1 has not entered the prepared phase at node M2, opening a new transaction will not block. In the DEBUG version, you can verify through the following steps.

connect 13000: 
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 1ca5023b-0a1d-11ec-82f9-c8f7507e5048 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1cab1e2b-0a1d-11ec-9eb9-c8f7507e5048 | 127.0.0.1   |       13004 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1caf096c-0a1d-11ec-a241-c8f7507e5048 | 127.0.0.1   |       13002 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1cbc3cf7-0a1d-11ec-955d-c8f7507e5048 | 127.0.0.1   |       13006 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1cc6f5eb-0a1d-11ec-8e81-c8f7507e5048 | 127.0.0.1   |       13008 | ONLINE       | PRIMARY     | 8.0.25         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
5 rows in set (0.03 sec)

mysql> set session group_replication_consistency='AFTER';
Query OK, 0 rows affected (0.00 sec)

mysql> create table t1 (c1 int primary key, c2 int); 
 Query OK, 0 rows affected (0.17 sec)

mysql> insert into t1 values (1,1); 
Query OK, 1 row affected (0.07 sec)

kill -9 13008
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 1ca5023b-0a1d-11ec-82f9-c8f7507e5048 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1cab1e2b-0a1d-11ec-9eb9-c8f7507e5048 | 127.0.0.1   |       13004 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1caf096c-0a1d-11ec-a241-c8f7507e5048 | 127.0.0.1   |       13002 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1cbc3cf7-0a1d-11ec-955d-c8f7507e5048 | 127.0.0.1   |       13006 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | 1cc6f5eb-0a1d-11ec-8e81-c8f7507e5048 | 127.0.0.1   |       13008 | UNREACHABLE  | PRIMARY     | 8.0.25         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
5 rows in set (0.00 sec)

connect 13002,use DEBUG_SYNC,Control node does not enter prepared Phase:
mysql> set global debug='+d,group_replication_before_commit_hook_wait';
Query OK, 0 rows affected (0.00 sec)

connect 13000,Insert a new transaction without returning
mysql> insert into t1 values (2,2);

connect 13002,It can be found that the transaction is DEBUG_SYNC Blocking:
mysql> select command, time, state, info from information_schema.processlist;
+---------+------+----------------------------+-----------------------------------------------------------------------+
| command | time | state                      | info                                                                  |
+---------+------+----------------------------+-----------------------------------------------------------------------+
| Connect |  189 | waiting for handler commit | Group replication applier module                                      |
| Query   |    0 | executing                  | select command, time, state, info from information_schema.processlist |
| Sleep   |    7 |                            | NULL                                                                  |
| Daemon  |  240 | Waiting on empty queue     | NULL                                                                  |
| Query   |   64 | debug sync point: now      | NULL                                                                  |
+---------+------+----------------------------+-----------------------------------------------------------------------+
5 rows in set (0.01 sec)

At this point, you can query the data:
mysql> set session group_replication_consistency='eventual';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from t1;
+----+------+
| c1 | c2   |
+----+------+
|  1 |    1 |
+----+------+
1 row in set (0.00 sec)

5. BEFORE execution process

AFTER corresponds to BEFORE. If a session has the BEFORE level turned on, you need to wait for all committed transactions to be committed locally when the transaction is turned on. This is through WAIT_FOR_GTID_SET implementation. At the start time of the transaction, obtain all the transaction gtids that the node has synchronously received, and wait for all the gtids in the collection to be submitted, so as to ensure that the latest data can be read when the transaction is executed.

However, before obtaining the gtid set, the node needs to send a sync through paxos_ BEFORE_ Message of type execution. Because paxos will queue messages, when sync_ BEFORE_ When execution processing is completed, it can ensure that all transaction messages before the message is sent are processed in paxos. Since this message is generated when this transaction is started, the gtid set received by the node at this time conforms to the before level.

If the node does not send SYNC_BEFORE_EXECUTION message, the before level may not be able to read the latest data. Assuming that there are three nodes A, B and C in the current network partition, after the network partition, nodes A and B form the majority and node C is the minority. At this time, the new write transactions on nodes A and B will not continue to be synchronized to node C. Before node C is kicked out of the cluster, if C turns on the before level but does not send SYNC_BEFORE_EXECUTION message, then new data cannot be read in C, which is contrary to the design purpose of before. However, after the message is sent, because the message consistency cannot be reached, the new transaction will fail or wait for the message to return without returning the user's outdated data.

The following example shows the execution behavior at the BEFORE level when most nodes fail.

Start phase:
mysql> select * from performance_schema.replication_group_members; 
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | dd6398ec-09fe-11ec-95de-c8f7507e5048 | 127.0.0.1   |       13002 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | dd64424b-09fe-11ec-aeeb-c8f7507e5048 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | dd65b9de-09fe-11ec-9d06-c8f7507e5048 | 127.0.0.1   |       13004 | ONLINE       | PRIMARY     | 8.0.25         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.06 sec)

mysql> select @@group_replication_member_expel_timeout;
+------------------------------------------+
| @@group_replication_member_expel_timeout |
+------------------------------------------+
|                                       60 |
+------------------------------------------+
1 row in set (0.00 sec)

Query data:
mysql> set session group_replication_consistency='BEFORE';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from t1;
+----+------+
| c1 | c2   |
+----+------+
|  2 |    2 |
+----+------+
1 row in set (0.01 sec)

use kill -9 Simulate most node failures:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | dd6398ec-09fe-11ec-95de-c8f7507e5048 | 127.0.0.1   |       13002 | UNREACHABLE  | PRIMARY     | 8.0.25         |
| group_replication_applier | dd64424b-09fe-11ec-aeeb-c8f7507e5048 | 127.0.0.1   |       13000 | UNREACHABLE  | PRIMARY     | 8.0.25         |
| group_replication_applier | dd65b9de-09fe-11ec-9d06-c8f7507e5048 | 127.0.0.1   |       13004 | ONLINE       | PRIMARY     | 8.0.25         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)

At this time, query the data again and never return
mysql> select * from t1;

Use another client to view the status, always in before hook Phase:
mysql> select command, time, state, info from information_schema.processlist;
+---------+------+--------------------------------------------------------+-----------------------------------------------------------------------+
| command | time | state                                                  | info                                                                  |
+---------+------+--------------------------------------------------------+-----------------------------------------------------------------------+
| Connect |  253 | waiting for handler commit                             | Group replication applier module                                      |
| Daemon  |  318 | Waiting on empty queue                                 | NULL                                                                  |
| Query   |  238 | Slave has read all relay log; waiting for more updates | NULL                                                                  |
| Query   |  170 | Executing hook on transaction begin.                   | select * from t1                                                      |
| Query   |    0 | executing                                              | select command, time, state, info from information_schema.processlist |
+---------+------+--------------------------------------------------------+-----------------------------------------------------------------------+
5 rows in set (0.00 sec)

Of course, because of the before level, an additional sync is sent_ BEFORE_ The execution message increases the network overhead once and has a certain impact on the performance.

6. Some thinking

  1. In AFTER mode, when the transactions of other nodes enter the prepared phase, but the user thread does not complete the commit, you must wait for the transaction commit when you require a new transaction to start. Is it necessary? Under the current design, this requirement will make nodes unavailable to a certain extent.
  2. AFTER mode needs to wait for all nodes to submit successfully. This is a strong consistent data synchronization scheme, but it will lead to the unavailability of the cluster at the same time. At the same time, due to the parameter group_replication_consistency is a session level control variable. Even if AFTER mode is enabled for a user connection, the whole cluster may not be available. An alternative is to adopt the majority AFTER mode of the open source great SQL branch (group_replication_majority_after_mode = ON), which can avoid the above problems.

7. Reference documents

  1. MySQL design document( https://dev.mysql.com/worklog/task/?id=10379 )
  2. MySQL user documentation( https://dev.mysql.com/doc/refman/8.0/en/group-replication-configuring-consistency-guarantees.html )

Enjoy GreatSQL :)

Posted on Fri, 26 Nov 2021 03:13:24 -0500 by raffael3d