Introduction to registry -- zookeeper analysis

The above articles introduce the knowledge related to RPC calls. This time, let's learn about the principle and practice of the registry. We mainly introduce from the following aspects:

  • Function and design analysis of Registration Center
  • Open source registry selection
  • In depth analysis of Nacos registry
  • In depth analysis of Zookeeper implementation

1, Function and design analysis of Registration Center

What is a registry?

It is used to realize the automatic registration and discovery of micro service instances. It is the core basic service in distributed system.

No registry

We can think about a scenario where there is no registry.

As shown in the figure above, multiple services call ServiceA, and then ServiceA needs to call ServiceB and serviceC. ServiceC has multiple instances. When there is no registry, we can save the node information of each service through the global configuration file. When the service information changes, the local configuration file is updated through each service. A problem here is that when this kind of service addition often occurs, the configuration file on each node slowly becomes different, which leads to the fact that this is not a global file, but a configuration file maintained by each node.

1. Main functions of the registry

The main functions of the registry include:

  • Service registration: including ip, port, service, etc
  • Service discovery: the service caller finds the required service provider node from the registry
  • Health check: the registration center needs to monitor and check the service provider
  • Change notification: when the service provider changes, the registry notifies the service caller of the change

(1) Service registration

The service provider publishes its own routing information to the registry for the consumer to obtain, establish a connection with the provider and initiate a call.

  • Routing information: register the IP of the service node, listening port and other routing information
  • Service information: serialization protocol, routing rules, node weights, etc

(2) Service discovery

The service consumer obtains the routing information of the service provider node by accessing the registry. There are three strategies for service discovery:

  • Start pull: after the service consumer starts, pull the list of provider nodes from the registry, establish a connection, and make RPC calls
  • Notification callback: receive the notification of registry change, retrieve the data, and update the node list
  • Polling pull: a bottom-up strategy. The service consumer regularly pulls the list of service provider nodes during operation to update local data

(3) Health examination

Ensure the health of registered nodes, eliminate failed nodes in time, and ensure the correctness of service discovery.

There are many reasons for service failure during our use, including:

  • Deployment restart
  • Service fake death
  • abort

For the above service failure, we have the following solutions

  • Report heartbeat: it can solve the situation of service restart and abnormal suspension. It is not necessary to distinguish the situation of service fake death
  • Service detection: advanced registry functions, which need to be developed by ourselves to adapt to relevant detection functions.

(4) Change notice

When the service provider node changes, the registry should be able to push the change event or changed data to the service subscriber at the first time. In the internal data structure of the registry, we need to establish a subscription list for each service provider, and notify all consumer nodes subscribing to the service when the service provider node changes.

2. Main function design of Registration Center

If we want to implement a registry, what functions do we need to include? Through the above analysis, we can find that the core part is service registration discovery and exception handling. There are two core designs:

  • data storage
  • timeout handler

(1) Registry storage design

The storage of the registry includes:

  • The < K, V > pair between the service caller and the service provider can facilitate the service caller to quickly query the service information to be called
  • The < K, V > pair between the service provider and the service caller. The meaning of this storage is to send event notifications to the callers subscribing to the service when the service provider changes; If there is no such storage structure, we need to cycle through the above storage structure to find the caller of the subscription and send a notification, which greatly affects the performance.

Key storage system concerns:

  • Data reliability: data is stored redundantly to ensure no data loss due to single node failure
  • Data consistency: data synchronization between nodes to ensure data consistency
  • Service availability: multi node peer-to-peer external service provision

(2) Other features of the registry

In addition to service registration and discovery, the registry can also be used to realize the functions related to service governance.

  • Service expansion / reduction
  • Machine migration
  • weight
  • Gray flow

(3) Thoughts on the registration center

CAP theorem: in distributed systems, C (data consistency), A (service availability) and P (partition fault tolerance) can only meet the two requirements.

What kind of storage (CP or AP) should our registry choose as the core function in the distributed system? This may still require us to start from the business scenario.

From a practical point of view, for service consumers, it is obviously better to obtain different node lists than not to obtain all service provider lists; For the service provider, the service provided by some nodes is obviously better than that not available at all. To sum up, the AP model is more suitable for the function of the registry.

3. Selection of Registration Center

If only CAP is considered in the selection of registration center, it is too one-sided. It also needs multi-dimensional comprehensive evaluation in combination with the actual scene.

  • data model
  • Data consistency
  • health examination
  • Performance and capacity
  • stability
  • Ease of use
  • Cluster scalability
  • Maturity
  • Community activity

(1) Registry comparison

Service health checkLong connectionheartbeatService statusConfigurable support
Multi data centersupport
kv storage servicesupportsupportsupport
uniformityzabraftraftWeak consistency
watchsupportsupportsupportLong polling
Client accessSDKhttphttp&dnshttp
Community supportpositivepositivepositivesuspend

4. In depth analysis of Nacos registry

Nacos is the implementation of the registry in Dubbo ecology. The functions of Nacos include:

  • Service registration and health check
  • data model
  • Data consistency assurance

(1) Health examination of Nacos

Temporary node: heartbeat registration

Persistent node: tcp/http probe

For temporary nodes, we use heartbeat reporting to check the activity of the service:

  • Report heartbeat every 5 seconds
  • No heartbeat is received for 15 seconds, marking the node as unhealthy
  • If no heartbeat is received for more than 30 seconds, remove this temporary node

(2) Data model

data storage

The data storage of Nacos is similar to the figure above. Service providers are divided into multiple clusters, and there are multiple application instances in each cluster to provide services. The advantage of this is that the service availability can be guaranteed to a greater extent.

Data isolation

Layer 4 data isolation:

  • account number
  • Namespace
  • grouping
  • Service name

(3) Data consistency

zab, raft CP consistency

Distro AP consistency

5. In depth analysis of Zookeeper implementation

(1) Node role

A cluster composed of server nodes. In the cluster, there is a unique leader node responsible for responding to write requests, and other nodes are only responsible for receiving and forwarding client requests

  • Leader: responds to the write request and initiates a proposal. More than half of the followers agree to write, and the writing is successful
  • Follower: respond to the query, send the write request to the Leader, participate in the election and write voting
  • ObServer: responds to the request, sends the write request to the Leader, does not participate in voting, and only receives the write result

(2) Select master logic

Zookeeper filters leaders. If you want to become a leader, you need to obtain legal data votes to succeed, that is, you can become a leader only if you obtain more than half of the votes.

Basis of judgment:

  • Epoch: leader's term of office
  • ZXID: Zookeeper transaction ID. the larger the transaction ID, the newer the data
  • SID: unique number of each node in the cluster

Comparison strategy: the one with a larger term of office wins, the one with the same term of office wins, the one with ZXID wins, and the one with ZXID wins

Referring to the figure above, in general, each node of Epoch is the same, and different epochs are not considered here.

Let's look at the source code of the selected master

* The method is to start a new round of elections. This method will be called when our cluster status changes to primary, and messages will be sent to all peer nodes
public Vote lookForLeader() throws InterruptedException {
        try {
            self.jmxLeaderElectionBean = new LeaderElectionBean();
            MBeanRegistry.getInstance().register(self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            self.jmxLeaderElectionBean = null;

        self.start_fle = Time.currentElapsedTime();
        try {
             * The votes from the current leader election are stored in recvset. In other words, a vote v is in recvset
             * if v.electionEpoch == logicalclock. The current participant uses recvset to deduce on whether a majority
             * of participants has voted for it.
            Map<Long, Vote> recvset = new HashMap<Long, Vote>();

             * The votes from previous leader elections, as well as the votes from the current leader election are
             * stored in outofelection. Note that notifications in a LOOKING state are not stored in outofelection.
             * Only FOLLOWING or LEADING notifications are stored in outofelection. The current participant could use
             * outofelection to learn which participant is the leader if it arrives late (i.e., higher logicalclock than
             * the electionEpoch of the received notifications) in a leader election.
            Map<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = minNotificationInterval;

            synchronized (this) {
              	// Number of electoral rounds plus 1
              	// Initialize the voting information. The first parameter, if you have permission to vote, is your node ID, the second parameter is the maximum transaction ID processed by the current node, and the third parameter is the tenure value
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());

                "New election. My id = {}, proposed zxid=0x{}",
          	// Send ballot information and send the information to the queue

            SyncedLearnerTracker voteSet = null;

             * Loop in which we exchange notifications until we find a leader

          	// The current node is not stopped and is in the primary state
            while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
                 * Get the message from the opposite of the received message
                Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);

                 * If no message is obtained from the queue
                if (n == null) {
                  	// All the messages in the message queue have been processed
                    if (manager.haveDelivered()) {
                      	// Send the latest message to the message queue
                    } else {
                      	// Create connections with all server nodes

                     * Exponential backoff
                    notTimeout = Math.min(notTimeout << 1, maxNotificationInterval);

                     * When a leader fails, a follower node will become a leader. There may be a delay in this handover. A timeout check will be performed here
                     * zookeeper The leader election algorithm of does not support the election from two nodes
                     * */
                    if (self.getQuorumVerifier() instanceof QuorumOracleMaj
                            && self.getQuorumVerifier().revalidateVoteset(voteSet, notTimeout != minNotificationInterval)) {
                        setPeerState(proposedLeader, voteSet);
                        Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
                        return endVote;

          "Notification time out: {} ms", notTimeout);

                } else if (validVoter(n.sid) && validVoter(n.leader)) {
                     * Only proceed if the vote comes from a replica in the current or next
                     * voting view for a replica in the current or next voting view.
                    switch (n.state) {
                    case LOOKING: // Select master status
                        // Abnormal judgment
                        if (getInitLastLoggedZxid() == -1) {
                            LOG.debug("Ignoring notification as our zxid is -1");
                        if (n.zxid == -1) {
                            LOG.debug("Ignoring notification from member with -1 zxid {}", n.sid);
                        // If the number of election rounds of the received ballot > the number of election rounds of the current node
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch); // Update the current number of election rounds
                            recvset.clear();  // Clean up previous ballot records
                           	// According to the master selection logic described above, compare epoch, zxid and sid respectively, and then update the votes
                            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
                          	// Send notification
                        } else if (n.electionEpoch < logicalclock.get()) { // Invalid ballot
                                    "Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x{}, logicalclock=0x{}",
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);

                            "Adding vote: from={}, proposed leader={}, proposed zxid=0x{}, proposed election epoch=0x{}",

                        // Save ballot information
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        // Vote counting comparison
                        voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));

                        if (voteSet.hasAllQuorums()) {

                            // Judge whether the next vote is consistent with the election result
                            while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {
                                if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {

                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                            if (n == null) {
                                setPeerState(proposedLeader, voteSet);
                                Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
                                return endVote;
                    case OBSERVING:
                        LOG.debug("Notification from observer: {}", n.sid);

                        * In ZOOKEEPER-3922, we separate the behaviors of FOLLOWING and LEADING.
                        * To avoid the duplication of codes, we create a method called followingBehavior which was used to
                        * shared by FOLLOWING and LEADING. This method returns a Vote. When the returned Vote is null, it follows
                        * the original idea to break swtich statement; otherwise, a valid returned Vote indicates, a leader
                        * is generated.
                        * The reason why we need to separate these behaviors is to make the algorithm runnable for 2-node
                        * setting. An extra condition for generating leader is needed. Due to the majority rule, only when
                        * there is a majority in the voteset, a leader will be generated. However, in a configuration of 2 nodes,
                        * the number to achieve the majority remains 2, which means a recovered node cannot generate a leader which is
                        * the existed leader. Therefore, we need the Oracle to kick in this situation. In a two-node configuration, the Oracle
                        * only grants the permission to maintain the progress to one node. The oracle either grants the permission to the
                        * remained node and makes it a new leader when there is a faulty machine, which is the case to maintain the progress.
                        * Otherwise, the oracle does not grant the permission to the remained node, which further causes a service down.
                        * In the former case, when a failed server recovers and participate in the leader election, it would not locate a
                        * new leader because there does not exist a majority in the voteset. It fails on the containAllQuorum() infinitely due to
                        * two facts. First one is the fact that it does do not have a majority in the voteset. The other fact is the fact that
                        * the oracle would not give the permission since the oracle already gave the permission to the existed leader, the healthy machine.
                        * Logically, when the oracle replies with negative, it implies the existed leader which is LEADING notification comes from is a valid leader.
                        * To threat this negative replies as a permission to generate the leader is the purpose to separate these two behaviors.
                        * */
                    case FOLLOWING:
                        * To avoid duplicate codes
                        * */
                        Vote resultFN = receivedFollowingNotification(recvset, outofelection, voteSet, n);
                        if (resultFN == null) {
                        } else {
                            return resultFN;
                    case LEADING:
                        * In leadingBehavior(), it performs followingBehvior() first. When followingBehavior() returns
                        * a null pointer, ask Oracle whether to follow this leader.
                        * */
                        Vote resultLN = receivedLeadingNotification(recvset, outofelection, voteSet, n);
                        if (resultLN == null) {
                        } else {
                            return resultLN;
                        LOG.warn("Notification state unrecognized: {} (n.state), {}(n.sid)", n.state, n.sid);
                } else {
                    if (!validVoter(n.leader)) {
                        LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                    if (!validVoter(n.sid)) {
                        LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
            return null;
        } finally {
            try {
                if (self.jmxLeaderElectionBean != null) {
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            self.jmxLeaderElectionBean = null;
            LOG.debug("Number of connection processing threads: {}", manager.getConnectionThreadCount());

There is an internal class: Messenger, which has two implementation classes: WorkReceiver and WorkSender. One is used to handle message sending and the other is used to handle message receiving.

(3) Data consistency assurance

Zookeeper has a famous Zab protocol, Zookeeper Atomic Broadcast.

zookeeper guarantees not strong consistency, but sequential consistency.

(4) Data model

The data storage of zookeeper is a tree structure, which is divided into permanent nodes and temporary nodes.

  • DataNode
    • DataNode parent: the reference of the parent node
    • byte data []: this node stores data
    • Long acl: acl control permission
    • StatPersisted stat: persistent node state
    • Set children: from node list
  • DataTree
    • Concurrenthashmap < string, datanode > nodes: key is path and value is datanode
    • WatchManager dataWatches: data change notifications
    • WatchManager childWatches: node change notification
    • String rootZookeeper: root node
    • Map < long, HashSet > Ephemerals: temporary node information. key is session and value is the set of path s
  • ZKDataBase
    • DataTree dataTree
    • Concurrenthashmap < long, integer > sessionswithtimeouts: client session connection management
    • FileTxnSnapLog snapLog: transaction log

Use zookeeper as the registry

  • Service registration: create temporary Node
  • Service discovery: query Node data
  • Health check: temporary node
  • Information subscription: Watch mechanism

zookeeper's weaknesses

  • zookeeper is sequential consistency and does not guarantee to read the latest data
  • Service unavailable during election

Tags: Java Big Data Zookeeper

Posted on Mon, 11 Oct 2021 18:49:49 -0400 by gordo2dope