preface:
The above analyzes the main process of starting in Zookeeper cluster mode, one of which is the election of Leader nodes.
Since all transaction type requests in Zookeeper are executed by the Leader node, Leader election is very important. This paper will look at the timing and execution process of Leader election.
1. Role of zookeeper cluster node
In Zookeeper cluster, nodes have the following three roles
Node role | major function |
Leader | 1. Processing transactional requests |
Follower | 1. Handle non transactional requests and forward transactional requests to the Leader 2. Participate in the transaction and request the Proposal to vote 3. Voting of leader election |
Observer | 1. Handle non transactional requests and forward transactional requests to the leader |
The difference between Observer and Follower is that Observer does not participate in the voting of leader election, nor does it participate in the more than half write success strategy of write operation. Therefore, it can improve the cluster read performance without affecting the write performance.
2. Timing of leader node election
When will the leader node be elected? There are two main opportunities:
*Server initialization and startup (when the cluster is just started, there is no leader node in the cluster, and a leader node needs to be selected)
*During cluster operation, the leader node is abnormal (the leader node service is abnormal, and other nodes cannot maintain communication with the leader node, then a leader will be re selected)
According to the node status we have analyzed before: LOOKING, FOLLOWING, LEADING, OBSERVING
When the leader is elected, the node is in the LOOKING state. When the leader election is completed, it is in the LEADING, FOLLOWING and OBSERVING states according to the different roles of the node.
3.leader node election process
In the process of cluster startup, the leader node election can be divided into the following steps
1) Each node initiates voting, selects itself as the leader node, and sends the voting information to other nodes [voting information is (myid,ZXID)];
2) Receive the votes of other nodes, check the validity of the votes, and process the votes.
The voting process is very simple, that is, PK other people's votes and their own votes:
* Compare the ZXID of the two tickets first, and the larger ZXID takes priority as the Leader;
* If ZXID is as large as ZXID, the one with larger myid will be used as the leader
After voting PK, the winning voting information will be re sent to other nodes.
3) Statistical voting
After each vote, the server counts the voting information to determine whether more than half of the machines have received the same voting information. If so, the machine corresponding to the vote is recognized as a Leader
4) Server state change
After step 3), now that the leader has been selected, each server will determine its own role, or leader or Follower, according to the results
Example:
In the previous article, we built a Zookeeper cluster in pseudo cluster mode, and myid is set as follows:
server information | myid |
zookeeper_3 | 1 |
zookeeper_2 | 2 |
zookeeper_1 | 3 |
When each node of the cluster is started, follow the above leader election process
1) Select yourself as the leader for the first vote. The voting information is as follows
server information | Voting information (myid, ZXID) |
zookeeper_3 | (1,0) |
zookeeper_2 | (2,0) |
zookeeper_1 | (3,0) |
2) Receive and process votes
Each machine receives the votes of other machines, performs PK respectively, and resends the PK results
server information | Voting information (myid, ZXID) | Re vote after processing |
zookeeper_3 | (1,0)PK (2,0) (1,0)PK (3,0) | (3,0) |
zookeeper_2 | (2,0)PK (3,0) (2,0)PK (1,0) | (3,0) |
zookeeper_1 | (3,0)PK (2,0) (3,0)PK (1,0) | (3,0) |
According to the results of the first round of voting, we re entered the second round of election and finally found that (3,0) was accepted by more than 2 machines, so Zookeeper_1 was elected leader.
3) Change server status
server information | Node status |
zookeeper_3 | Follower |
zookeeper_2 | Follower |
zookeeper_1 | Leader |
The above describes the leader election process during the initial operation. What is the leader re election during the cluster operation?
A: it's basically the same process. After the leader goes down, the whole cluster suspends the provision of external services and enters a new round of leader election. First, each node changes its status to LOOKING, and then re initiates voting according to the above steps, or does the vote PK follow the same way. After the new leader is finally determined, everyone performs their respective duties and enters their respective roles, and the cluster resumes the provision of external services.
The above describes the whole process of leader election from the perspective of text. Let's analyze the whole process from the perspective of source code
4.Leader election entry analysis
The entrance to the leader election, which we analyzed in the previous article, is the quorumpeer. Startleadelection () method
public class QuorumPeer extends ZooKeeperThread implements QuorumStats.Provider { // ip:port for current node election private InetSocketAddress myQuorumAddr; public InetSocketAddress getQuorumAddress(){ return myQuorumAddr; } // leader election method private int electionType; Election electionAlg; synchronized public void startLeaderElection() { try { // Create current ballot currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch()); } catch(IOException e) { RuntimeException re = new RuntimeException(e.getMessage()); re.setStackTrace(e.getStackTrace()); throw re; } for (QuorumServer p : getView().values()) { if (p.id == myid) { // Gets the election ip:port of the current node myQuorumAddr = p.addr; break; } } if (myQuorumAddr == null) { throw new RuntimeException("My id " + myid + " not in the peer list"); } // In QuorumPeerConfig, the default setting is electionAlg = 3 if (electionType == 0) { try { udpSocket = new DatagramSocket(myQuorumAddr.getPort()); responder = new ResponderThread(); responder.start(); } catch (SocketException e) { throw new RuntimeException(e); } } // Get the leader election algorithm here this.electionAlg = createElectionAlgorithm(electionType); } protected Election createElectionAlgorithm(int electionAlgorithm){ Election le=null; //TODO: use a factory rather than a switch switch (electionAlgorithm) { case 0: le = new LeaderElection(this); break; case 1: le = new AuthFastLeaderElection(this); break; case 2: le = new AuthFastLeaderElection(this, true); break; case 3: qcm = createCnxnManager(); // The current node communicates with other Zookeeper nodes. See 4 for details QuorumCnxManager.Listener listener = qcm.listener; if(listener != null){ listener.start(); // By default, FastLeaderElection is used for leader election, and subsequent analysis will continue 5 le = new FastLeaderElection(this, qcm); } else { LOG.error("Null listener when initializing cnx manager"); } break; default: assert false; } return le; } }
According to the source code, the current version 3.4.13 Zookeeper uses FastLeaderElection as the leader election algorithm by default.
The task of communicating with other nodes is implemented by QuorumCnxManager. Let's take a brief look at its implementation first.
five Communication mechanism of QuorumCnxManager
five point one Basic parameters of QuorumCnxManager
Start with QuorumPeer's call to its construction method
public QuorumCnxManager createCnxnManager() { return new QuorumCnxManager(this.getId(), this.getView(), this.authServer, this.authLearner, this.tickTime * this.syncLimit, this.getQuorumListenOnAllIPs(), this.quorumCnxnThreadsSize, this.isQuorumSaslAuthEnabled()); }
The final setting properties of the called constructor are as follows
public class QuorumCnxManager { // Each SendWorker corresponds to a sending thread, which is used to send messages to the corresponding node final ConcurrentHashMap<Long, SendWorker> senderWorkerMap; // The received messages are stored in a collection public final ArrayBlockingQueue<Message> recvQueue; // Message sending queue corresponding to each sid final ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>> queueSendMap; // The last message sent for each sid final ConcurrentHashMap<Long, ByteBuffer> lastMessageSent; // Connection timeout private int cnxTO = 5000; // Configured myid information final long mySid; // Read write timeout final int socketTimeout; // Node information (ip port, etc.) corresponding to each sid final Map<Long, QuorumPeer.QuorumServer> view; // Listener thread, which is used to create ServerSocket service and receive connections from other nodes public final Listener listener; public QuorumCnxManager(final long mySid, Map<Long,QuorumPeer.QuorumServer> view, QuorumAuthServer authServer, QuorumAuthLearner authLearner, int socketTimeout, boolean listenOnAllIPs, int quorumCnxnThreadsSize, boolean quorumSaslAuthEnabled, ConcurrentHashMap<Long, SendWorker> senderWorkerMap) { this.senderWorkerMap = senderWorkerMap; this.recvQueue = new ArrayBlockingQueue<Message>(RECV_CAPACITY); this.queueSendMap = new ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>>(); this.lastMessageSent = new ConcurrentHashMap<Long, ByteBuffer>(); String cnxToValue = System.getProperty("zookeeper.cnxTimeout"); if(cnxToValue != null){ this.cnxTO = Integer.parseInt(cnxToValue); } this.mySid = mySid; this.socketTimeout = socketTimeout; this.view = view; this.listenOnAllIPs = listenOnAllIPs; initializeAuth(mySid, authServer, authLearner, quorumCnxnThreadsSize, quorumSaslAuthEnabled); // Starts listener thread that waits for connection requests listener = new Listener(); } }
5.2 QuorumCnxManager.Listener
public class Listener extends ZooKeeperThread { volatile ServerSocket ss = null; @Override public void run() { int numRetries = 0; InetSocketAddress addr; while((!shutdown) && (numRetries < 3)){ try { // Create a listening service for the corresponding port ss = new ServerSocket(); ss.setReuseAddress(true); if (listenOnAllIPs) { int port = view.get(QuorumCnxManager.this.mySid) .electionAddr.getPort(); addr = new InetSocketAddress(port); } else { addr = view.get(QuorumCnxManager.this.mySid) .electionAddr; } LOG.info("My election bind port: " + addr.toString()); setName(view.get(QuorumCnxManager.this.mySid) .electionAddr.toString()); ss.bind(addr); while (!shutdown) { Socket client = ss.accept(); setSockOpts(client); LOG.info("Received connection request " + client.getRemoteSocketAddress()); if (quorumSaslAuthEnabled) { receiveConnectionAsync(client); } else { // Get connection receiveConnection(client); } numRetries = 0; } } catch (IOException e) { ... } } ... } // Handle connections from other nodes public void receiveConnection(final Socket sock) { DataInputStream din = null; try { din = new DataInputStream( new BufferedInputStream(sock.getInputStream())); // Handled by handleConnection handleConnection(sock, din); } catch (IOException e) { LOG.error("Exception handling connection, addr: {}, closing server connection", sock.getRemoteSocketAddress()); closeSocket(sock); } } private void handleConnection(Socket sock, DataInputStream din) throws IOException { Long sid = null; try { // Read server id sid = din.readLong(); if (sid < 0) { // this is not a server id but a protocol version (see ZOOKEEPER-1633) sid = din.readLong(); // next comes the #bytes in the remainder of the message // note that 0 bytes is fine (old servers) int num_remaining_bytes = din.readInt(); if (num_remaining_bytes < 0 || num_remaining_bytes > maxBuffer) { LOG.error("Unreasonable buffer length: {}", num_remaining_bytes); closeSocket(sock); return; } byte[] b = new byte[num_remaining_bytes]; // remove the remainder of the message from din int num_read = din.read(b); if (num_read != num_remaining_bytes) { LOG.error("Read only " + num_read + " bytes out of " + num_remaining_bytes + " sent by server " + sid); } } ... } catch (IOException e) { closeSocket(sock); LOG.warn("Exception reading or writing challenge: " + e.toString()); return; } // By default, the high sid sends a connection request to the low sid node. Therefore, if the low sid node sends a connection request to the current high sid node, the connection will be closed directly if (sid < this.mySid) { SendWorker sw = senderWorkerMap.get(sid); if (sw != null) { sw.finish(); } LOG.debug("Create new connection to server: " + sid); closeSocket(sock); connectOne(sid); } else { // Create corresponding SendWorker and RecvWorker for this connection for subsequent sending and receiving messages SendWorker sw = new SendWorker(sock, sid); RecvWorker rw = new RecvWorker(sock, din, sid, sw); sw.setRecv(rw); SendWorker vsw = senderWorkerMap.get(sid); if(vsw != null) vsw.finish(); senderWorkerMap.put(sid, sw); queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY)); sw.start(); rw.start(); return; } } }
QuorumCnxManager.Listener is mainly used to create a listening service corresponding to the port and listen to the connections of other nodes.
Here's a note: a connection can only be created from a node with a high sid to a node with a low sid. Otherwise, the connection will be closed.
After the connection is successfully created, SendWorker and RecvWorker threads are created for the connection to send and receive messages. Readers can read the main contents of these two threads by themselves.
6.FastLeaderElection election algorithm
When each node in the cluster is just started and the node status is LOOKING, the Leader election will be conducted. The election process is as described above. Let's show the above analysis process through code.
6.1 send a vote and elect yourself as a Leader
In the QuorumPeer.run() method, different operations will be performed according to the status of the current node. Let's look at the operations when the status is LOOKING
public class QuorumPeer extends ZooKeeperThread implements QuorumStats.Provider { public void run() { while (running) { switch (getPeerState()) { case LOOKING: LOG.info("LOOKING"); // Nodes in readonly status are not the focus of this analysis and are directly ignored if (Boolean.getBoolean("readonlymode.enabled")) { ... } else { try { setBCVote(null); // Makelestrategy(). Lookfor leader(). This is the key point. See 5.1.1 setCurrentVote(makeLEStrategy().lookForLeader()); } catch (Exception e) { LOG.warn("Unexpected exception", e); setPeerState(ServerState.LOOKING); } } break; ... } } } }
6.1.1 FastLeaderElection.lookForLeader() leader election
public class FastLeaderElection implements Election { public Vote lookForLeader() throws InterruptedException { ... try { HashMap<Long, Vote> recvset = new HashMap<Long, Vote>(); HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>(); int notTimeout = finalizeWait; synchronized(this){ // The logicalclock is self incremented once logicalclock.incrementAndGet(); updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch()); } // Send voting information to other nodes. See 5.1.2 for details sendNotifications(); while ((self.getPeerState() == ServerState.LOOKING) && (!stop)){ // Get votes from recvqueue receive queue Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS); if(n == null){ if(manager.haveDelivered()){ sendNotifications(); } else { // It is explained here that if you have not established a connection with other nodes in the cluster, you need to establish a connection first. See 5.1.3 for details manager.connectAll(); } int tmpTimeOut = notTimeout*2; notTimeout = (tmpTimeOut < maxNotificationInterval? tmpTimeOut : maxNotificationInterval); LOG.info("Notification time out: " + notTimeout); } ... } } } }
6.1.2 FastLeaderElection.sendNotifications() sends voting information to other nodes
public class FastLeaderElection implements Election { private void sendNotifications() { // Send ToSend voting information to other nodes in the cluster for (QuorumServer server : self.getVotingView().values()) { long sid = server.id; ToSend notmsg = new ToSend(ToSend.mType.notification, proposedLeader, proposedZxid, logicalclock.get(), QuorumPeer.ServerState.LOOKING, sid, proposedEpoch); if(LOG.isDebugEnabled()){ ... } sendqueue.offer(notmsg); } } }
The voting information sent here is not sent synchronously. Instead, the voting information is added to the sendqueue and then sent through the WorkerSender.
6.1.3 QuorumCnxManager.connectAll() creates a connection with other nodes
public class QuorumCnxManager { public void connectAll(){ long sid; for(Enumeration<Long> en = queueSendMap.keys(); en.hasMoreElements();){ sid = en.nextElement(); connectOne(sid); } } synchronized public void connectOne(long sid){ // If a connection has not been established // The established connection will create the corresponding key value in senderWorkerMap if (!connectedToPeer(sid)){ InetSocketAddress electionAddr; if (view.containsKey(sid)) { electionAddr = view.get(sid).electionAddr; } else { LOG.warn("Invalid server id: " + sid); return; } try { LOG.debug("Opening channel to server " + sid); Socket sock = new Socket(); setSockOpts(sock); // Create the connection in the most primitive way sock.connect(view.get(sid).electionAddr, cnxTO); LOG.debug("Connected to server " + sid); if (quorumSaslAuthEnabled) { initiateConnectionAsync(sock, sid); } else { // Initialize connection initiateConnection(sock, sid); } } ... } else { LOG.debug("There is a connection already for server " + sid); } } // Initialize connection private boolean startConnection(Socket sock, Long sid) throws IOException { DataOutputStream dout = null; DataInputStream din = null; try { // Send the sid of the current node to the past dout = new DataOutputStream(sock.getOutputStream()); dout.writeLong(this.mySid); dout.flush(); din = new DataInputStream( new BufferedInputStream(sock.getInputStream())); } catch (IOException e) { LOG.warn("Ignoring exception reading or writing challenge: ", e); closeSocket(sock); return false; } // authenticate learner authLearner.authenticate(sock, view.get(sid).hostname); // According to the previous verification rules, when comparing two SIDS, only the larger sid service can initiate a connection to the smaller sid service if (sid > this.mySid) { LOG.info("Have smaller server identifier, so dropping the " + "connection: (" + sid + ", " + this.mySid + ")"); closeSocket(sock); // Otherwise proceed with the connection } else { // After the connection is completed, the corresponding SendWorker and RecvWorker threads are created for each connection SendWorker sw = new SendWorker(sock, sid); RecvWorker rw = new RecvWorker(sock, din, sid, sw); sw.setRecv(rw); SendWorker vsw = senderWorkerMap.get(sid); if(vsw != null) vsw.finish(); senderWorkerMap.put(sid, sw); queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY)); sw.start(); rw.start(); return true; } return false; } }
Q: As for the last question about sending voting information for the first time, the information sent in the SendWorker.run() method is obtained from the queueSendMap. When was the data in the map added?
A: This can be left to the readers to think for themselves (you can start with the FastLeaderElection.Messenger class when prompted)
Summary: when each node sends a vote for the first time, it will regard itself as a Leader, and then send the vote to other nodes in the cluster.
To send information, you need to create a connection first. Each node will actively create a connection to all other nodes, but only the connection actively created from high Sid to low sid is valid.
Sending and receiving messages are executed by two threads respectively (SendWorker and RecvWorker threads in QuorumCnxManager)
6.2 receiving and processing voting information
6.2.1 receiving voting results from other nodes
RecvWorker is responsible for receiving voting results. Let's take a direct look at its run() method
class RecvWorker extends ZooKeeperThread { public void run() { threadCnt.incrementAndGet(); try { while (running && !shutdown && sock != null) { int length = din.readInt(); if (length <= 0 || length > PACKETMAXSIZE) { throw new IOException( "Received packet with invalid packet: " + length); } // Read results byte[] msgArray = new byte[length]; din.readFully(msgArray, 0, length); ByteBuffer message = ByteBuffer.wrap(msgArray); // Wrapped as a Message and added to the QuorumCnxManager.recvQueue() queue addToRecvQueue(new Message(message.duplicate(), sid)); } } ... } }
Who processes the messages in the recvQueue queue? It is actually handled by FastLeaderElection.Messenger
6.2.2 Messenger processes the received voting information
protected class Messenger { class WorkerReceiver extends ZooKeeperThread { public void run() { Message response; while (!stop) { // Sleeps on receive try{ // Get the voting information of other nodes response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS); if(response == null) continue; // If the current voting set does not contain the sid of the response, send your own voting information and send it later if(!validVoter(response.sid)){ Vote current = self.getCurrentVote(); ToSend notmsg = new ToSend(ToSend.mType.notification, current.getId(), current.getZxid(), logicalclock.get(), self.getPeerState(), response.sid, current.getPeerEpoch()); sendqueue.offer(notmsg); } else { ... Notification n = new Notification(); // Get the status of the candidate QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING; switch (response.buffer.getInt()) { case 0: ackstate = QuorumPeer.ServerState.LOOKING; break; case 1: ackstate = QuorumPeer.ServerState.FOLLOWING; break; case 2: ackstate = QuorumPeer.ServerState.LEADING; break; case 3: ackstate = QuorumPeer.ServerState.OBSERVING; break; default: continue; } // Get the basic information of the candidate and wrap it in the Notification n.leader = response.buffer.getLong(); n.zxid = response.buffer.getLong(); n.electionEpoch = response.buffer.getLong(); n.state = ackstate; n.sid = response.sid; if(!backCompatibility){ n.peerEpoch = response.buffer.getLong(); } else { if(LOG.isInfoEnabled()){ LOG.info("Backward compatibility mode, server id=" + n.sid); } n.peerEpoch = ZxidUtils.getEpochFromZxid(n.zxid); } n.version = (response.buffer.remaining() >= 4) ? response.buffer.getInt() : 0x0; ... // If the current node is in the LOOKING state (it is indeed in the LOOKING state during initialization startup) if(self.getPeerState() == QuorumPeer.ServerState.LOOKING){ // Wrap the obtained voting information of other nodes into Notification and add it to recvqueue, which will be used later recvqueue.offer(n); // If the election period of the received voting information is less than that of the current node, the voting information of the current node will be sent directly if((ackstate == QuorumPeer.ServerState.LOOKING) && (n.electionEpoch < logicalclock.get())){ Vote v = getVote(); ToSend notmsg = new ToSend(ToSend.mType.notification, v.getId(), v.getZxid(), logicalclock.get(), self.getPeerState(), response.sid, v.getPeerEpoch()); sendqueue.offer(notmsg); } } else { ... } } } catch (InterruptedException e) { System.out.println("Interrupted Exception while waiting for new message" + e.toString()); } } LOG.info("WorkerReceiver is down"); } } }
There is a key operation above, which is to package the obtained voting information of other nodes into Notification and add it to recvqueue (recvqueue.offer(n) operation). This is added to the recvqueue collection. How to use it later?
6.2.3 FastLeaderElection.lookForLeader()
Let's go back to this method, which contains the processing method of Notification.
public class FastLeaderElection implements Election { public Vote lookForLeader() throws InterruptedException { while ((self.getPeerState() == ServerState.LOOKING) &&(!stop)){ // Get Notification information Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS); else if(validVoter(n.sid) && validVoter(n.leader)) { switch (n.state) { case LOOKING: // If the received voting cycle is greater than the logical clock of the current node, reset the logical clock of the current node and clear the currently received voting cycle if (n.electionEpoch > logicalclock.get()) { logicalclock.set(n.electionEpoch); recvset.clear(); // The vote received by the voting PK of the current node will re vote the winner if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) { updateProposal(n.leader, n.zxid, n.peerEpoch); } else { updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch()); } sendNotifications(); // If the received voting cycle is less than the logical clock of the current node, it will be ignored directly, indicating that this voting is invalid } else if (n.electionEpoch < logicalclock.get()) { if(LOG.isDebugEnabled()){ LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x" + Long.toHexString(n.electionEpoch) + ", logicalclock=0x" + Long.toHexString(logicalclock.get())); } break; // Compare zxid, epoch and other information according to the rules, and the winner will vote again } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) { updateProposal(n.leader, n.zxid, n.peerEpoch); sendNotifications(); } if(LOG.isDebugEnabled()){ LOG.debug("Adding vote: from=" + n.sid + ", proposed leader=" + n.leader + ", proposed zxid=0x" + Long.toHexString(n.zxid) + ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch)); } // Archive the received ballots into recvset recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch)); // Judge whether more than half of the same votes have been cast. If so, it indicates that the leader has been elected if (termPredicate(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch))) { // If the votes are changed and a better leader appears than before, while((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null){ if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)){ recvqueue.put(n); break; } } // Return the previously selected leader, wrap it in vote, and return if (n == null) { // Finally, judge whether the current id is the selected leader id. if yes, update the current node status to LEADING, otherwise it is FOLLOWING/OBSERVING self.setPeerState((proposedLeader == self.getId()) ? ServerState.LEADING: learningState()); Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch); leaveInstance(endVote); return endVote; } } break; } } } }
The code is long and critical. The main process of Leader election is reflected in the above code.
The voting process is the same as that shown above. Compare ZXID and sid. The specific code is as follows:
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) { LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" + Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid)); if(self.getQuorumVerifier().getWeight(newId) == 0){ return false; } return ((newEpoch > curEpoch) || ((newEpoch == curEpoch) && ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId))))); }
Summary:
The Leader election is a difficulty in Zookeeper. When learning the code, the author also has all kinds of twists and turns, and the pain index is three stars.
Of course, after writing it, I only deepened my impression, but sometimes I look back and feel a little confused, so I need to read the code often. It's OK to debug.
If you still feel tired, just remember the following two sentences:
1. The newer the data processed by which machine in the cluster (the one with the largest ZXID), the more likely it is to become a Leader;
2. If everyone is as big as ZXID, the leader with the largest sid is the leader;
reference resources:
[distributed] Leader election of Zookeeper - leesf - blog Park