Zookeeper cluster and Kafka cluster

1, Zookeeper overview

1. Zookeeper definition

Zookeeper is an open source distributed Apache project that provides coordination services for distributed frameworks.

2. Zookeeper working mechanism

Zookeeper understands it from the perspective of design pattern: it is a distributed service management framework designed based on observer pattern. It is responsible for storing and managing data that everyone cares about, and then accepting the registration of observers. Once the status of these data changes, zookeeper will be responsible for notifying those observers who have registered on zookeeper to respond accordingly. That is, zookeeper = file system + notification mechanism.

3. Zookeeper features

(1) Zookeeper: a cluster composed of one Leader and multiple followers.
(2) As long as more than half of the nodes in the Zookeeper cluster survive, the Zookeeper cluster can serve normally. Therefore, Zookeeper is suitable for installing an odd number of servers.
(3) Global data consistency: each Server saves a copy of the same data. No matter which Server the Client connects to, the data is consistent.
(4) Update requests are executed in sequence. Update requests from the same Client are executed in sequence according to their sending order, that is, first in first out.
(5) Data update is atomic. A data update either succeeds or fails.
(6) Real time. Within a certain time range, the Client can read the latest data.

4. Zookeeper data structure

The structure of ZooKeeper data model is very similar to that of Linux file system. On the whole, it can be regarded as a tree, and each node is called a ZNode. Each ZNode can store 1MB of data by default, and each ZNode can be uniquely identified through its path.

5. Zookeeper application scenario

The services provided include: unified naming service, unified configuration management, unified cluster management, dynamic uplink and downlink of server nodes, soft load balancing, etc.
● unified naming service
In the distributed environment, it is often necessary to uniformly name applications / services for easy identification. For example, IP is not easy to remember, while domain name is easy to remember.

● unified configuration management
(1) In distributed environment, profile synchronization is very common. Generally, the configuration information of all nodes in a cluster is consistent, such as Kafka cluster. After modifying the configuration file, you want to be able to quickly synchronize to each node.
(2) Configuration management can be implemented by ZooKeeper. The configuration information can be written to a Znode on the ZooKeeper. Each client server listens to this Znode. Once the data in Znode is modified, ZooKeeper will notify each client server.

● unified cluster management
(1) In distributed environment, it is necessary to master the state of each node in real time. Some adjustments can be made according to the real-time status of the node.
(2) ZooKeeper can monitor node status changes in real time. Node information can be written to a ZNode on the ZooKeeper. Monitoring this ZNode can obtain its real-time state changes.

● server dynamic online and offline
The client can have real-time insight into the changes on and off the server.

● soft load balancing
Record the number of accesses of each server in Zookeeper, and let the server with the least number of accesses handle the latest client requests.

6. Zookeeper election mechanism

:: launch of the electoral mechanism for the first time
(1) Server 1 starts and initiates an election. Server 1 voted for itself. At this time, server 1 has one vote, less than half (3 votes), the election cannot be completed, and the state of server 1 remains LOOKING;
(2) Server 2 starts and initiates another election. Servers 1 and 2 vote for themselves and exchange vote information: at this time, server 1 finds that the myid of server 2 is larger than that of their current vote (server 1), and changes the vote to recommend server 2. At this time, there are 0 votes for server 1 and 2 votes for server 2. Without more than half of the results, the election cannot be completed, and the status of server 1 and 2 remains LOOKING
(3) Server 3 starts and initiates an election. Servers 1 and 2 change to server 3. The voting results: 0 votes for server 1, 0 votes for server 2 and 3 votes for server 3. At this time, server 3 has more than half of the votes, and server 3 is elected Leader. The status of server 1 and 2 is changed to FOLLOWING, and the status of server 3 is changed to LEADING;
(4) Server 4 starts and initiates an election. At this time, servers 1, 2 and 3 are no longer in the LOOKING state, and the ballot information will not be changed. Results of exchange of ballot information: server 3 has 3 votes and server 4 has 1 vote. At this time, server 4 obeys the majority, changes the vote information to server 3, and changes the status to FOLLOWING; (5) Server 5 starts, just like 4.

● not the first time the electoral mechanism has been activated
(1) When one of the following two situations occurs to a server in the ZooKeeper cluster, it will start to enter the Leader election:
1) Server initialization started.
2) Unable to maintain connection with the Leader while the server is running.

(2) When a machine enters the Leader election process, the current cluster may also be in the following two states:
① A Leader already exists in the cluster.
When the machine attempts to elect a Leader when there is already a Leader, it will be informed of the Leader information of the current server. For the machine, it only needs to establish a connection with the Leader machine and synchronize the status.

② The Leader does not exist in the cluster.
Suppose ZooKeeper is composed of five servers with SID of 1, 2, 3, 4 and 5, ZXID of 8, 8, 8, 7 and 7, and the server with SID of 3 is the Leader. At some point, the 3 and 5 servers failed, so the Leader election began.
Election Leader rules:
1.EPOCH big direct winner
2. The same as epoch, the one with a large transaction id wins
3. The one with the same transaction id and the larger server id wins

SID: server ID. It is used to uniquely identify the machines in a ZooKeeper cluster. Each machine cannot be duplicated and is consistent with myid.
ZXID: transaction ID. ZXID is a transaction ID used to identify a server state change. At a certain time, the ZXID value of each machine in the cluster may not be exactly the same, which is related to the processing logic speed of ZooKeeper server for client "update request".
Epoch: the code of each Leader's tenure. When there is no Leader, the logical clock value in the same round of voting is the same. This figure increases with each vote

2, Deploy Zookeeper cluster

Prepare 3 servers for Zookeeper cluster
1. Preparation before installation

 #Turn off firewall
systemctl stop firewalld
systemctl disable firewalld
setenforce 0   
#Install JDK
yum install -y java-1.8.0-openjdk java-1.8.0-openjdk-devel
java -version
#Download installation package
cd /opt
wget https://archive.apache.org/dist/zookeeper/zookeeper-3.5.7/apache-zookeeper-3.5.7-bin.tar.gz

 scp /opt/apache-zookeeper-3.5.7-bin.tar.gz
 scp /opt/apache-zookeeper-3.5.7-bin.tar.gz
Copy the downloaded installation package to the other two servers

Official download address https://archive.apache.org/dist/zookeeper/
2. Install Zookeeper
① Unzip the package

cd /opt
tar -zxvf apache-zookeeper-3.5.7-bin.tar.gz
mv apache-zookeeper-3.5.7-bin /usr/local/zookeeper-3.5.7

② Modify profile

cd /usr/local/zookeeper-3.5.7/conf/
cp zoo_sample.cfg zoo.cfg

vim zoo.cfg
tickTime=2000   #Communication heartbeat time, heartbeat time between Zookeeper server and client, unit: ms
initLimit=10    #The maximum number of heartbeats that the Leader and Follower can tolerate during initial connection (the number of ticktimes), expressed here as 10*2s
syncLimit=5     #The timeout of synchronous communication between Leader and Follower, which means that if it exceeds 5*2s, Leader thinks Follower is dead and deletes Follower from the server list
dataDir=/usr/local/zookeeper-3.5.7/data      ●Modify, specify save Zookeeper The directory of data in. The directory needs to be created separately
dataLogDir=/usr/local/zookeeper-3.5.7/logs   ●Add and specify the directory for storing logs. The directory needs to be created separately
clientPort=2181   #Client connection port
#Add cluster information

●A Is a number indicating the server number. In cluster mode, you need to zoo.cfg in dataDir Create a file in the specified directory myid,There is a data in this file A The value of, Zookeeper Read this file at startup and get the data and zoo.cfg Compare the configuration information inside to determine which one it is server. 
●B Is the address of this server.
●C This is the server Follower With in the cluster Leader The port on which the server exchanges information.
●D It's in the cluster Leader The server is down. You need a port to re elect and choose a new one Leader,This port is used to communicate with each other during the election.

③ Copy the configured Zookeeper configuration file to other machines

scp /usr/local/zookeeper-3.5.7/conf/zoo.cfg
scp /usr/local/zookeeper-3.5.7/conf/zoo.cfg

④ Create data directories and log directories on each node

mkdir /usr/local/zookeeper-3.5.7/data
mkdir /usr/local/zookeeper-3.5.7/logs

⑤ Create a myid file in the directory specified by dataDir of each node

echo 1 > /usr/local/zookeeper-3.5.7/data/myid
echo 2 > /usr/local/zookeeper-3.5.7/data/myid
echo 3 > /usr/local/zookeeper-3.5.7/data/myid

⑥ Configure Zookeeper startup script
vim /etc/init.d/zookeeper

#chkconfig:2345 20 90
#description:Zookeeper Service Control Script
case $1 in
	echo "---------- zookeeper start-up ------------"
	$ZK_HOME/bin/zkServer.sh start
	echo "---------- zookeeper stop it ------------"
	$ZK_HOME/bin/zkServer.sh stop
	echo "---------- zookeeper restart ------------"
	$ZK_HOME/bin/zkServer.sh restart
	echo "---------- zookeeper state ------------"
	$ZK_HOME/bin/zkServer.sh status
    echo "Usage: $0 {start|stop|restart|status}"

⑦ Set startup and self startup

chmod +x /etc/init.d/zookeeper
chkconfig --add zookeeper
#Start Zookeeper separately
service zookeeper start
#View current status
service zookeeper status

All three servers have to start

3, Kafka

1. Kafka overview

Kafka is a distributed Message Queue (MQ) based on publish / subscribe mode, which is mainly used for big data real-time.

In the high concurrency environment, the synchronization request is too late to process, and the request is often blocked. For example, a large number of requests access the database concurrently, resulting in row locks and table locks. Finally, too many request threads will accumulate, triggering a too many connection error and causing an avalanche effect.
We use message queuing to process requests asynchronously, so as to relieve the pressure of the system. Message queuing is often used in asynchronous processing, traffic peak shaving, application decoupling, message communication and other scenarios.

Currently common MQ Middleware has ActiveMQ,RabbitMQ,RocketMQ,Kafka Wait.

2. Benefits of using message queuing

1) Decoupling
It allows you to extend or modify the processes on both sides independently, as long as you ensure that they comply with the same interface constraints.

(2) Recoverability
Failure of some components of the system will not affect the whole system. Message queuing reduces the coupling between processes, so even if a process processing messages hangs, the messages added to the queue can still be processed after the system recovers.

(3) Buffer
It helps to control and optimize the speed of data flow through the system and solve the inconsistency between the processing speed of production messages and consumption messages.

(4) Flexibility & peak processing power
In the case of a sharp increase in traffic, applications still need to continue to play a role, but such burst traffic is not common. It would be a huge waste to put resources on standby to handle such peak visits. Using message queuing can make key components withstand the sudden access pressure without completely crashing due to sudden overloaded requests.

(5) Asynchronous communication
Many times, users do not want or need to process messages immediately. Message queuing provides an asynchronous processing mechanism that allows users to put a message on the queue without processing it immediately. Put as many messages into the queue as you want, and then process them when needed.

3. Two modes of message queuing

1) Point to point mode (one-to-one, consumers actively pull data, and the message is cleared after receiving the message)
The message producer sends the message to the message queue, and then the message consumer takes it out of the message queue and consumes the message. After the message is consumed, there is no storage in the message queue, so it is impossible for the message consumer to consume the consumed message. Message queuing supports the existence of multiple consumers, but for a message, only one consumer can consume.

(2) Publish / subscribe mode (one to many, also known as observer mode. Messages will not be cleared after consumer consumption data)
The message producer (publisher) publishes the message to the topic, and multiple message consumers (subscribers) consume the message at the same time. Unlike the peer-to-peer method, messages published to topic will be consumed by all subscribers.
Publish / subscribe mode defines a one to many dependency between objects, so that whenever the state of an object (target object) changes, all objects (observer object) that depend on it will be notified and updated automatically.

4. Characteristics of Kafka

● high throughput and low latency
Kafka can process hundreds of thousands of messages per second, and its latency is as low as a few milliseconds. Each topic can be divided into multiple partitions. The Consumer Group can consume partitions to improve load balancing and consumption capacity.

kafka Cluster supports hot expansion

●Persistence and reliability
 Messages are persisted to local disk, and data backup is supported to prevent data loss

●Fault tolerance
 Allow nodes in the cluster to fail (in case of multiple replicas, if the number of replicas is) n,Then allow n-1 Nodes failed)

●High concurrency
 Support thousands of clients to read and write at the same time

5. Kafka system architecture

A kafka server is a broker. A cluster consists of multiple brokers. A broker can accommodate multiple topic s.

It can be understood as a queue, and both producers and consumers are oriented to one topic.
Similar to the database table name or ES index
Physically, messages of different topic s are stored separately

In order to achieve scalability, a very large topic can be distributed to multiple broker s (i.e. servers). A topic can be divided into one or more partitions, and each partition is an ordered queue. Kafka only guarantees that the records in the partition are in order, not the order of different partitions in the topic.

4, Deploy Kafka cluster

1. Download the installation package
Official download address: http://kafka.apache.org/downloads.html

cd /opt
wget https://mirrors.tuna.tsinghua.edu.cn/apache/kafka/2.7.1/kafka_2.13-2.7.1.tgz

Install kafka

tar zxvf kafka_2.13-2.7.1.tgz
mv kafka_2.13-2.7.1 /usr/local/kafka

2. Modify profile

cd /usr/local/kafka/config/
cp server.properties{,.bak}

vim server.properties
broker.id=0    ●21 that 's ok, broker Globally unique number of each broker It cannot be repeated, so it should be configured on other machines broker.id=1,broker.id=2
listeners=PLAINTEXT:// line 31 specifies the IP and port to listen to. If the IP of each broker needs to be modified separately, the default configuration can be maintained without modification
num.network.threads=3    #Line 42, the number of threads that the broker processes network requests. Generally, it does not need to be modified
num.io.threads=8         #Line 45, the number of threads used to process disk IO. The value should be greater than the number of hard disks
socket.send.buffer.bytes=102400       #Line 48, buffer size of send socket
socket.receive.buffer.bytes=102400    #Line 51, buffer size of receive socket
socket.request.max.bytes=104857600    #Line 54, the buffer size of the request socket
log.dirs=/usr/local/kafka/logs        #Line 60: the path where kafka operation logs are stored is also the path where data is stored
num.partitions=1    #In line 65, the default number of partitions of topic on the current broker will be overwritten by the specified parameters when topic is created
num.recovery.threads.per.data.dir=1    #69 lines, the number of threads used to recover and clean data
log.retention.hours=168    #In line 103, the maximum retention time of segment file (data file), in hours, defaults to 7 days, and the timeout will be deleted
log.segment.bytes=1073741824    #110 lines. The maximum size of a segment file is 1G by default. If it exceeds, a new segment file will be created
zookeeper.connect=,,    ●123 Rows, configuring connections Zookeeper Cluster address

scp -r kafka/
scp -r kafka/

3. Modify environment variables

vim /etc/profile
export KAFKA_HOME=/usr/local/kafka

source /etc/profile

4. Configure Zookeeper startup script
vim /etc/init.d/kafka

#chkconfig:2345 22 88
#description:Kafka Service Control Script
case $1 in
	echo "---------- Kafka start-up ------------"
	${KAFKA_HOME}/bin/kafka-server-start.sh -daemon ${KAFKA_HOME}/config/server.properties
	echo "---------- Kafka stop it ------------"
	$0 stop
	$0 start
	echo "---------- Kafka state ------------"
	count=$(ps -ef | grep kafka | egrep -cv "grep|$$")
	if [ "$count" -eq 0 ];then
        echo "kafka is not running"
        echo "kafka is running"
    echo "Usage: $0 {start|stop|restart|status}"

5. Set startup and self startup

chmod +x /etc/init.d/kafka
chkconfig --add kafka
 Start separately Kafka
service kafka start

6. Kafka command line operation

establish topic
kafka-topics.sh --create --zookeeper,, --replication-factor 2 --partitions 3 --topic test

--zookeeper: definition zookeeper Cluster server address, if there are multiple IP Addresses are separated by commas, usually one IP that will do
--replication-factor: Define the number of partition replicas. 1 represents a single replica, and 2 is recommended 
--partitions: Define number of partitions 
--topic: definition topic name

View all in the current server topic
kafka-topics.sh --list --zookeeper,, 

View a topic Details of
kafka-topics.sh  --describe --zookeeper,, 

Release news
kafka-console-producer.sh --broker-list,,  --topic test

Consumption news
kafka-console-consumer.sh --bootstrap-server,, --topic test --from-beginning

--from-beginning: All previous data in the topic will be read out

Modify the number of partitions
kafka-topics.sh --zookeeper,, --alter --topic test --partitions 6

delete topic
kafka-topics.sh --delete --zookeeper,, --topic test

Create a topic and view all topics in the current server
View the details of a topic

Consumption news

5, Deploy Zookeeper+Kafka cluster

1. Deploy Filebeat

cd /usr/local/filebeat

vim filebeat.yml
- type: log
  enabled: true
    - /etc/httpd/logs/access_log

#Add configuration output to Kafka
  enabled: true
  hosts: ["","",""]    #Specify Kafka cluster configuration
  topic: "kafka_test"    #Specify topic for Kafka

2. Deploy ELK and create a new Logstash configuration file on the node where the Logstash component is located

vim filebeat.conf
input {
    kafka {
        bootstrap_servers => ",,"
        topics  => "kafka_test"
        group_id => "test123"
        auto_offset_reset => "earliest"

output {
    elasticsearch {
        hosts => [""]
        index => "filebeat_test-%{+YYYY.MM.dd}"
    stdout {
        codec => rubydebug

Browser access Log in to Kibana, click the "Create Index Pattern" button to add the index "filebeat_test - *", click the "create" button to create, and click the "Discover" button to view the chart information and log information.

Tags: Big Data kafka Zookeeper

Posted on Mon, 04 Oct 2021 20:31:01 -0400 by lttldude9