Source code analysis of redis cluster visit

gossip

Update in github later, please stamp redis Cluster - > visit

Catalog

cluster bus

Come from cluster-tutorial

Each Redis cluster node needs two TCP connections. The general TCP port is used to interact with the client, such as 6379. There is also a port from the general port plus 10000 later, such as 16379

The second port with high value is used as the communication between clusters, which is a binary protocol used for exchanging information between nodes. Its use scenarios include downtime detection, configuration update, failover authorization, etc. the client should connect through a common port instead of trying to connect through this port. In addition, you need to ensure your fire protection Open the two ports at the same time, otherwise the nodes between clusters will not be able to communicate

The difference between the command port and the cluster communication port is fixed, which is 10000

Note that each node of a working cluster needs the following conditions

The normal client interaction port (usually 6379) of all nodes in the cluster needs to be open to all available clients, and all nodes in the cluster also need to open this port to each other (needed for key transfer)

The communication port between clusters (common port + 10000) needs to be open to all other nodes in the cluster
If you do not open both TCP ports, your cluster may have problems after running

The communication between clusters uses a different, binary protocol, which is used for data communication between nodes. Compared with the original protocol, this protocol consumes less bandwidth and processing time, and is more suitable for data exchange between nodes

When will the message be sent

Redis currently supports the following message types

#define CLUSTERMSG_TYPE_PING 0          /* Ping */
#Define clustermsg? Type? Pong 1 / * Pong (used when replying to Ping message)*/
#Define clustermsg? Type? Meet 2 / * message of meet joining the cluster*/
#Define clustermsg? Type? Fail 3 / * mark a node as a failure*/
#Define clustermsg? Type? Publish 4 / * publish message push on publish subscription*/
#Define clustermsg'type'failover'auto'request 5 / * can I fail over*/
#Define clustermsg'type'failover'auto'ack 6 / * yes, I'll vote for you*/
#Define clustermsg? Type? Update 7 / * cluster node configuration information*/
#Define clustermsg? Type? Mfstart 8 / * manually pause this node for manual failover*/
#Define clustermsg? Type? Module 9 / * cluster module API message*/
#Define clustermsg'type'count 10 / * total number of message types*/

clusterCron will traverse each node. If there is no active connection between the current node and the traversed node, it will call clusterSendPing to send PING message

clusterCron will also send PING messages to one node randomly every second

/* Every 10 cycles, we randomly Ping some nodes. Usually, we randomly send Ping messages to a node every second */
if (!(iteration % 10)) {
    int j;

    /* Randomly check several nodes and PING the node with the latest Pong [received] */
    for (j = 0; j < 5; j++) {
        de = dictGetRandomKey(server.cluster->nodes);
        clusterNode *this = dictGetVal(de);

        /* Do not ping nodes that have been disconnected or currently have an active ping */
        if (this->link == NULL || this->ping_sent != 0) continue;
        if (this->flags & (CLUSTER_NODE_MYSELF|CLUSTER_NODE_HANDSHAKE))
            continue;
        if (min_pong_node == NULL || min_pong > this->pong_received) {
            min_pong_node = this;
            min_pong = this->pong_received;
        }
    }
    if (min_pong_node) {
        serverLog(LL_DEBUG,"Pinging node %.40s", min_pong_node->name);
        clusterSendPing(min_pong_node->link, CLUSTERMSG_TYPE_PING);
    }
}

clusterCron traverses all nodes again and sends PING messages when

/* If the current node and this node are not sending PING messages before, and the received PONG is longer than half of the cluster timeout,
 * Then send a new PING message to this node. This strategy can ensure that the PING delay between all nodes and the current node is not too high */
if (node->link &&
    node->ping_sent == 0 &&
    (now - node->pong_received) > server.cluster_node_timeout/2)
{
    clusterSendPing(node->link, CLUSTERMSG_TYPE_PING);
    continue;
}


/* If we are a master, and one of the slaves requires a manual failover, then PING this node continuously */
if (server.cluster->mf_end &&
    nodeIsMaster(myself) &&
    server.cluster->mf_slave == node &&
    node->link)
{
    clusterSendPing(node->link, CLUSTERMSG_TYPE_PING);
    continue;
}

So, if we are nodeC, we will keep the connection with nodeA, nodeB and slaves of all nodeA, and send PING messages continuously according to the above rules

ping

This is the structure of the MSG between two nodes

This is the structure and content of the PING message

clusterMsg stores and sends some information, such as epoch number, slot bit mapping, general port number, cluster interaction port number, etc

In the clusterMsgData field, each message will bring some cluster node information stored by the sender. Therefore, after several handshakes of message Ping / pong, one node will finally get the new status of all other nodes

The number of actual node information in clusterMsgData is min(freshnodes, wanted)

/* freshnodes Is the maximum number of node information we can increase to the tail

 * Number of all available nodes - 2 (ourselves and the target node we are sending now)

 * In fact, there may be fewer nodes that can be used than the above formula, for example, those in the handshake stage and those in disconnection are not considered
 */
int freshnodes = dictSize(server.cluster->nodes)-2;

/*
/* How many nodes need to be added to the message? 1 / 10 of all nodes
 * And at least 3
 * 1/10 The reason for this is in the redis/src/cluster.c source code annotation
 * The main reason is that this proportion can receive error report information within a certain period of time
 */
wanted = floor(dictSize(server.cluster->nodes)/10);

Therefore, the selection of nodes in clusterMsgData is random

pong

Every PING message will have a PONG message as the reply. The reply of the PING message is the PONG message. The construction of the PONG message is basically the same as that of the PING message, except that the type in HDR - > type is different

int clusterProcessPacket(clusterLink *link) {
	/* ... */
    /* PING And MEET messages need to generate a PONG message as a reply*/
    if (type == CLUSTERMSG_TYPE_PING || type == CLUSTERMSG_TYPE_MEET) {
        /* ... */
        clusterSendPing(link,CLUSTERMSG_TYPE_PONG);
    }
    /* ... */
}

/* To create a header, the hdr must point to a buffer with the size of sizeof(clusterMsg) space */
void clusterBuildMessageHdr(clusterMsg *hdr, int type) {
    /* ... */
    hdr->ver = htons(CLUSTER_PROTO_VER);
    hdr->sig[0] = 'R';
    hdr->sig[1] = 'C';
    hdr->sig[2] = 'm';
    hdr->sig[3] = 'b';
    hdr->type = htons(type);
    memcpy(hdr->sender,myself->name,CLUSTER_NAMELEN);
    /* ... */
}

If you are interested in other types of messages, please refer to the source code in redis/src/cluster.c

34 original articles published, 15 praised, 10000 visitors+
Private letter follow

Tags: Redis github less

Posted on Sun, 19 Jan 2020 09:06:12 -0500 by timj