Zookeeper's Election Algorithms and Deep Explanation of Crack Problem

ZK introduction

ZK = zookeeper

ZK is the core environment for service registration and discovery in micro-service solutions, and is the cornerstone of micro-service. As a service registration and discovery module, ZK is not the only product. At present, Eureka and Consul are recognized by the industry.

Here we just talk about ZK. The tool itself is a few megabytes small zip package. The installation is very foolish and can support cluster deployment.

Official address: https://zookeeper.apache.org/

background

In the cluster environment, the concept of leader & follower of ZK has already faced the problem of node exception ZK and how to solve it. ZK itself is a java language development, but also open source to Github, but the official documents on the internal introduction is very little, scattered blogs are many, some write very good.

Put questions to:

ZK cluster single node state (each node has and only one state), ZK positioning must require a leader node in lading state.

  • Look: Look for the leader status. Currently, there is no leader in the cluster, so it enters the leader election process.
  • following: follower status, receiving synchronization and command from leading nodes.
  • Lead: Leadership.
  • observing: Observer status, table name, current server is observer.

Over-half Election Algorithms

There are three election algorithms in ZK. They are Leader Election, Fast Leader Election, AuthLeader Election, Fast Leader Election and Autoh Leader Election. The only difference is that the latter adds authentication information. Fast Leader Election is more efficient than Leader Election, and subsequent versions only retain Fast Leader Election.

Understand:

In the cluster environment, when multiple nodes start, ZK first needs to select one node as leader and be in the leading state, so it faces an election problem, and what is the election rule? "Over-half Election Algorithms": Nodes that get more than half of the votes in the voting elections win, that is, the state changes from looking to leading, which is more efficient.

Description of official website information: Clustered (Multi-Server) Setup As follows:

As long as a majority of the ensemble are up, the service will be available. Because Zookeeper requires a majority, it is best to use an odd number of machines. For example, with four machines ZooKeeper can only handle the failure of a single machine; if two machines fail, the remaining two machines do not constitute a majority. However, with five machines ZooKeeper can handle the failure of two machines.

Explain ideas with 5 servers:

  1. Server 1 starts, at this time only one of its servers starts, and the Vote it sends out has no response, so its election status has always been LOOKING status.
  2. Server 2 starts, it communicates with the first server 1 that started, and exchanges its own election results. Because neither of them has historical data, Server 2 with larger id value wins, but Server 1, 2 also wins because more than half of the servers do not agree to elect it (more than half of the servers in this example agree to elect it). Keep LOOKING status.
  3. Server 3 is started. According to the previous theory, it is analyzed that three servers elect it. Server 3 becomes the eldest one in server 1, 2, 3, so it becomes the leader of this election.
  4. Server 4 starts. According to the previous analysis, Server 4 should be the largest in Server 1, 2, 3 and 4 in theory, but since more than half of the servers have elected Server 3 in front of it, it can only accept the life of a younger brother.
  5. Server 5 starts, like 4, as a younger brother.

Source code parsing:

URL: FastLeaderElection

/**
 * Starts a new round of leader election. Whenever our QuorumPeer
 * changes its state to LOOKING, this method is invoked, and it
 * sends notifications to all other peers.
 */
public Vote lookForLeader() throws InterruptedException {
    try {
        self.jmxLeaderElectionBean = new LeaderElectionBean();
        MBeanRegistry.getInstance().register(self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
    } catch (Exception e) {
        LOG.warn("Failed to register with JMX", e);
        self.jmxLeaderElectionBean = null;
    }

    self.start_fle = Time.currentElapsedTime();
    try {
        Map<Long, Vote> recvset = new HashMap<Long, Vote>();

        Map<Long, Vote> outofelection = new HashMap<Long, Vote>();

        int notTimeout = minNotificationInterval;

        synchronized (this) {
            logicalclock.incrementAndGet();
            updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
        }

        LOG.info("New election. My id =  " + self.getId() + ", proposed zxid=0x" + Long.toHexString(proposedZxid));
        sendNotifications();

        SyncedLearnerTracker voteSet;

        /*
         * Loop in which we exchange notifications until we find a leader
         */

        while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
            /*
             * Remove next notification from queue, times out after 2 times
             * the termination time
             */
            Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);

            /*
             * Sends more notifications if haven't received enough.
             * Otherwise processes new notification.
             */
            if (n == null) {
                if (manager.haveDelivered()) {
                    sendNotifications();
                } else {
                    manager.connectAll();
                }

                /*
                 * Exponential backoff
                 */
                int tmpTimeOut = notTimeout * 2;
                notTimeout = (tmpTimeOut < maxNotificationInterval ? tmpTimeOut : maxNotificationInterval);
                LOG.info("Notification time out: " + notTimeout);
            } else if (validVoter(n.sid) && validVoter(n.leader)) {
                /*
                 * Only proceed if the vote comes from a replica in the current or next
                 * voting view for a replica in the current or next voting view.
                 */
                switch (n.state) {
                case LOOKING:
                    if (getInitLastLoggedZxid() == -1) {
                        LOG.debug("Ignoring notification as our zxid is -1");
                        break;
                    }
                    if (n.zxid == -1) {
                        LOG.debug("Ignoring notification from member with -1 zxid {}", n.sid);
                        break;
                    }
                    // If notification > current, replace and send messages out
                    if (n.electionEpoch > logicalclock.get()) {
                        logicalclock.set(n.electionEpoch);
                        recvset.clear();
                        if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                        } else {
                            updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
                        }
                        sendNotifications();
                    } else if (n.electionEpoch < logicalclock.get()) {
                        if (LOG.isDebugEnabled()) {
                            LOG.debug(
                                "Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x" + Long.toHexString(n.electionEpoch)
                                + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                        }
                        break;
                    } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                        updateProposal(n.leader, n.zxid, n.peerEpoch);
                        sendNotifications();
                    }

                    if (LOG.isDebugEnabled()) {
                        LOG.debug("Adding vote: from=" + n.sid
                                  + ", proposed leader=" + n.leader
                                  + ", proposed zxid=0x" + Long.toHexString(n.zxid)
                                  + ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                    }

                    // don't care about the version if it's in LOOKING state
                    recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                    voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));

                    if (voteSet.hasAllQuorums()) {

                        // Verify if there is any change in the proposed leader
                        while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {
                            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                                recvqueue.put(n);
                                break;
                            }
                        }

                        /*
                         * This predicate is true once we don't read any new
                         * relevant message from the reception queue
                         */
                        if (n == null) {
                            setPeerState(proposedLeader, voteSet);
                            Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                    }
                    break;
                case OBSERVING:
                    LOG.debug("Notification from observer: {}", n.sid);
                    break;
                case FOLLOWING:
                case LEADING:
                    /*
                     * Consider all notifications from the same epoch
                     * together.
                     */
                    if (n.electionEpoch == logicalclock.get()) {
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                        voteSet = getVoteTracker(recvset, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                        if (voteSet.hasAllQuorums() && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                            setPeerState(n.leader, voteSet);
                            Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                    }

                    /*
                     * Before joining an established ensemble, verify that
                     * a majority are following the same leader.
                     */
                    outofelection.put(n.sid, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                    voteSet = getVoteTracker(outofelection, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));

                    if (voteSet.hasAllQuorums() && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                        synchronized (this) {
                            logicalclock.set(n.electionEpoch);
                            setPeerState(n.leader, voteSet);
                        }
                        Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                        leaveInstance(endVote);
                        return endVote;
                    }
                    break;
                default:
                    LOG.warn("Notification state unrecoginized: " + n.state + " (n.state), " + n.sid + " (n.sid)");
                    break;
                }
            } else {
                if (!validVoter(n.leader)) {
                    LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                }
                if (!validVoter(n.sid)) {
                    LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
                }
            }
        }
        return null;
    } finally {
        try {
            if (self.jmxLeaderElectionBean != null) {
                MBeanRegistry.getInstance().unregister(self.jmxLeaderElectionBean);
            }
        } catch (Exception e) {
            LOG.warn("Failed to unregister with JMX", e);
        }
        self.jmxLeaderElectionBean = null;
        LOG.debug("Number of connection processing threads: {}", manager.getConnectionThreadCount());
    }
}
/*
* We return true if one of the following three cases hold:
* 1- New epoch is higher
* 2- New epoch is the same as current epoch, but new zxid is higher
* 3- New epoch is the same as current epoch, new zxid is the same
*  as current zxid, but server id is higher.
*/

return ((newEpoch > curEpoch)
	|| ((newEpoch == curEpoch)
		&& ((newZxid > curZxid)
			|| ((newZxid == curZxid)
				&& (newId > curId)))));

Cerebral fissure problem

Brain fissure problems occur when leader dies in clusters. When follower chooses a new leader and the former leader revives, because more than half of the ZK mechanism allows a certain number of machines to be lost and thrown into service normally, there will be multiple leaders when leader death judgment is inconsistent.

The plan:

More than half of the ZK mechanism also reduces the occurrence of fissures to a certain extent, at least not three leader s at the same time. The Epoch mechanism (clock) in ZK is incremental + 1 for each election. When communicating, it is necessary to judge whether epoch is consistent or not. If epoch is less than itself, it is abandoned. If it is greater than itself, it resets itself, which is equal to election.

// If notification > current, replace and send messages out
if (n.electionEpoch > logicalclock.get()) {
    logicalclock.set(n.electionEpoch);
    recvset.clear();
    if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
        updateProposal(n.leader, n.zxid, n.peerEpoch);
    } else {
        updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
    }
    sendNotifications();
} else if (n.electionEpoch < logicalclock.get()) {
    if (LOG.isDebugEnabled()) {
        LOG.debug(
            "Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x" + Long.toHexString(n.electionEpoch)
            + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
    }
    break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
    updateProposal(n.leader, n.zxid, n.peerEpoch);
    sendNotifications();
}

Deployment Principles under Half Electoral Strategy:

  1. Server cluster deployment needs singular number, such as: 3, 5, 7,..., singular number is the easiest to select the configuration quantity of leader.
  2. ZK allows the maximum number of loss nodes, the principle is "to ensure that more than half of the elections are normal", more is waste.

Detailed algorithmic logic is very complex to consider many situations, including the concept of Epoch (self-growth), divided into: Logic Epoch and Election Epoch, each vote has to determine whether each voting cycle is consistent, and so on.

Induce

In the daily operation and maintenance of ZK, we need to pay attention to the above scenarios in extreme cases, especially the occurrence of fissures, which can be used:

When thinking about ZK strategy, we often encounter such problems (two pieces above). I combed some ideas for easy understanding and as a follow-up review, especially thanks for the support of the following blog posts, thanks for sharing.

Author: Owen Jia

Pay attention to his blog: Owen Blog

Reference to blog material:

zookeeper3.3.5

Understanding the election mechanism of zookeeper

Principle of Zookeeper Election Algorithms

After reading this article, you will have a clear idea of Zoo Keeper.

What is cleft brain? How does Zookeeper solve this problem?

zookeeper's fissure problem

Tags: Programming Zookeeper Apache Java github

Posted on Tue, 24 Sep 2019 00:01:03 -0400 by juddster