1, Zookeeper overview
1. Zookeeper definition
Zookeeper is an open source distributed Apache project that provides coordination services for distributed frameworks.
2. Zookeeper working mechanism
Zookeeper understands it from the perspective of design pattern: it is a distributed service management framework designed based on observer pattern. It is responsible for storing and managing data that everyone cares about, and then accepting the registration of observers. Once the status of these data changes, zookeeper will be responsible for notifying those observers who have registered on zookeeper to respond accordingly. That is, zookeeper = file system + notification mechanism.
3. Zookeeper features
(1) Zookeeper: a cluster composed of one Leader and multiple followers.
(2) As long as more than half of the nodes in the Zookeeper cluster survive, the Zookeeper cluster can serve normally. Therefore, Zookeeper is suitable for installing an odd number of servers.
(3) Global data consistency: each Server saves a copy of the same data. No matter which Server the Client connects to, the data is consistent.
(4) Update requests are executed in sequence. Update requests from the same Client are executed in sequence according to their sending order, that is, first in first out.
(5) Data update is atomic. A data update either succeeds or fails.
(6) Real time. Within a certain time range, the Client can read the latest data.
4. Zookeeper data structure
The structure of ZooKeeper data model is very similar to that of Linux file system. On the whole, it can be regarded as a tree, and each node is called a ZNode. Each ZNode can store 1MB of data by default, and each ZNode can be uniquely identified through its path.
5. Zookeeper application scenario
The services provided include: unified naming service, unified configuration management, unified cluster management, dynamic uplink and downlink of server nodes, soft load balancing, etc.
● unified naming service
In the distributed environment, it is often necessary to uniformly name applications / services for easy identification. For example, IP is not easy to remember, while domain name is easy to remember.
● unified configuration management
(1) In distributed environment, profile synchronization is very common. Generally, the configuration information of all nodes in a cluster is consistent, such as Kafka cluster. After modifying the configuration file, you want to be able to quickly synchronize to each node.
(2) Configuration management can be implemented by ZooKeeper. The configuration information can be written to a Znode on the ZooKeeper. Each client server listens to this Znode. Once the data in Znode is modified, ZooKeeper will notify each client server.
● unified cluster management
(1) In distributed environment, it is necessary to master the state of each node in real time. Some adjustments can be made according to the real-time status of the node.
(2) ZooKeeper can monitor node status changes in real time. Node information can be written to a ZNode on the ZooKeeper. Monitoring this ZNode can obtain its real-time state changes.
● server dynamic online and offline
The client can have real-time insight into the changes on and off the server.
● soft load balancing
Record the number of accesses of each server in Zookeeper, and let the server with the least number of accesses handle the latest client requests.
6. Zookeeper election mechanism
:: launch of the electoral mechanism for the first time
(1) Server 1 starts and initiates an election. Server 1 voted for itself. At this time, server 1 has one vote, less than half (3 votes), the election cannot be completed, and the state of server 1 remains LOOKING;
(2) Server 2 starts and initiates another election. Servers 1 and 2 vote for themselves and exchange vote information: at this time, server 1 finds that the myid of server 2 is larger than that of their current vote (server 1), and changes the vote to recommend server 2. At this time, there are 0 votes for server 1 and 2 votes for server 2. Without more than half of the results, the election cannot be completed, and the status of server 1 and 2 remains LOOKING
(3) Server 3 starts and initiates an election. Servers 1 and 2 change to server 3. The voting results: 0 votes for server 1, 0 votes for server 2 and 3 votes for server 3. At this time, server 3 has more than half of the votes, and server 3 is elected Leader. The status of server 1 and 2 is changed to FOLLOWING, and the status of server 3 is changed to LEADING;
(4) Server 4 starts and initiates an election. At this time, servers 1, 2 and 3 are no longer in the LOOKING state, and the ballot information will not be changed. Results of exchange of ballot information: server 3 has 3 votes and server 4 has 1 vote. At this time, server 4 obeys the majority, changes the vote information to server 3, and changes the status to FOLLOWING; (5) Server 5 starts, just like 4.
● not the first time the electoral mechanism has been activated
(1) When one of the following two situations occurs to a server in the ZooKeeper cluster, it will start to enter the Leader election:
1) Server initialization started.
2) Unable to maintain connection with the Leader while the server is running.
(2) When a machine enters the Leader election process, the current cluster may also be in the following two states:
① A Leader already exists in the cluster.
When the machine attempts to elect a Leader when there is already a Leader, it will be informed of the Leader information of the current server. For the machine, it only needs to establish a connection with the Leader machine and synchronize the status.
② The Leader does not exist in the cluster.
Suppose ZooKeeper is composed of five servers with SID of 1, 2, 3, 4 and 5, ZXID of 8, 8, 8, 7 and 7, and the server with SID of 3 is the Leader. At some point, the 3 and 5 servers failed, so the Leader election began.
Election Leader rules:
1.EPOCH big direct winner
2. The same as epoch, the one with a large transaction id wins
3. The one with the same transaction id and the larger server id wins
SID: server ID. It is used to uniquely identify the machines in a ZooKeeper cluster. Each machine cannot be duplicated and is consistent with myid.
ZXID: transaction ID. ZXID is a transaction ID used to identify a server state change. At a certain time, the ZXID value of each machine in the cluster may not be exactly the same, which is related to the processing logic speed of ZooKeeper server for client "update request".
Epoch: the code of each Leader's tenure. When there is no Leader, the logical clock value in the same round of voting is the same. This figure increases with each vote
2, Deploy Zookeeper cluster
Prepare 3 servers for Zookeeper cluster
192.168.50.20
192.168.50.37
192.168.50.40
1. Preparation before installation
#Turn off firewall systemctl stop firewalld systemctl disable firewalld setenforce 0 #Install JDK yum install -y java-1.8.0-openjdk java-1.8.0-openjdk-devel java -version #Download installation package cd /opt wget https://archive.apache.org/dist/zookeeper/zookeeper-3.5.7/apache-zookeeper-3.5.7-bin.tar.gz
scp /opt/apache-zookeeper-3.5.7-bin.tar.gz 192.168.50.37:/opt/ scp /opt/apache-zookeeper-3.5.7-bin.tar.gz 192.168.50.40:/opt/ Copy the downloaded installation package to the other two servers
Official download address https://archive.apache.org/dist/zookeeper/
2. Install Zookeeper
① Unzip the package
cd /opt tar -zxvf apache-zookeeper-3.5.7-bin.tar.gz mv apache-zookeeper-3.5.7-bin /usr/local/zookeeper-3.5.7
② Modify profile
cd /usr/local/zookeeper-3.5.7/conf/ cp zoo_sample.cfg zoo.cfg vim zoo.cfg tickTime=2000 #Communication heartbeat time, heartbeat time between Zookeeper server and client, unit: ms initLimit=10 #The maximum number of heartbeats that the Leader and Follower can tolerate during initial connection (the number of ticktimes), expressed here as 10*2s syncLimit=5 #The timeout of synchronous communication between Leader and Follower, which means that if it exceeds 5*2s, Leader thinks Follower is dead and deletes Follower from the server list dataDir=/usr/local/zookeeper-3.5.7/data ●Modify, specify save Zookeeper The directory of data in. The directory needs to be created separately dataLogDir=/usr/local/zookeeper-3.5.7/logs ●Add and specify the directory for storing logs. The directory needs to be created separately clientPort=2181 #Client connection port #Add cluster information server.1=192.168.50.20:3188:3288 server.2=192.168.50.37:3188:3288 server.3=192.168.50.40:3188:3288
server.A=B:C:D ●A Is a number indicating the server number. In cluster mode, you need to zoo.cfg in dataDir Create a file in the specified directory myid,There is a data in this file A The value of, Zookeeper Read this file at startup and get the data and zoo.cfg Compare the configuration information inside to determine which one it is server. ●B Is the address of this server. ●C This is the server Follower With in the cluster Leader The port on which the server exchanges information. ●D It's in the cluster Leader The server is down. You need a port to re elect and choose a new one Leader,This port is used to communicate with each other during the election.
③ Copy the configured Zookeeper configuration file to other machines
scp /usr/local/zookeeper-3.5.7/conf/zoo.cfg 192.168.50.37:/usr/local/zookeeper-3.5.7/conf/ scp /usr/local/zookeeper-3.5.7/conf/zoo.cfg 192.168.50.40:/usr/local/zookeeper-3.5.7/conf/
④ Create data directories and log directories on each node
mkdir /usr/local/zookeeper-3.5.7/data mkdir /usr/local/zookeeper-3.5.7/logs
⑤ Create a myid file in the directory specified by dataDir of each node
echo 1 > /usr/local/zookeeper-3.5.7/data/myid echo 2 > /usr/local/zookeeper-3.5.7/data/myid echo 3 > /usr/local/zookeeper-3.5.7/data/myid
⑥ Configure Zookeeper startup script
vim /etc/init.d/zookeeper
#!/bin/bash #chkconfig:2345 20 90 #description:Zookeeper Service Control Script ZK_HOME='/usr/local/zookeeper-3.5.7' case $1 in start) echo "---------- zookeeper start-up ------------" $ZK_HOME/bin/zkServer.sh start ;; stop) echo "---------- zookeeper stop it ------------" $ZK_HOME/bin/zkServer.sh stop ;; restart) echo "---------- zookeeper restart ------------" $ZK_HOME/bin/zkServer.sh restart ;; status) echo "---------- zookeeper state ------------" $ZK_HOME/bin/zkServer.sh status ;; *) echo "Usage: $0 {start|stop|restart|status}" esac
⑦ Set startup and self startup
chmod +x /etc/init.d/zookeeper chkconfig --add zookeeper #Start Zookeeper separately service zookeeper start #View current status service zookeeper status
All three servers have to start
3, Kafka
1. Kafka overview
Kafka is a distributed Message Queue (MQ) based on publish / subscribe mode, which is mainly used for big data real-time.
In the high concurrency environment, the synchronization request is too late to process, and the request is often blocked. For example, a large number of requests access the database concurrently, resulting in row locks and table locks. Finally, too many request threads will accumulate, triggering a too many connection error and causing an avalanche effect.
We use message queuing to process requests asynchronously, so as to relieve the pressure of the system. Message queuing is often used in asynchronous processing, traffic peak shaving, application decoupling, message communication and other scenarios.
Currently common MQ Middleware has ActiveMQ,RabbitMQ,RocketMQ,Kafka Wait.
2. Benefits of using message queuing
1) Decoupling
It allows you to extend or modify the processes on both sides independently, as long as you ensure that they comply with the same interface constraints.
(2) Recoverability
Failure of some components of the system will not affect the whole system. Message queuing reduces the coupling between processes, so even if a process processing messages hangs, the messages added to the queue can still be processed after the system recovers.
(3) Buffer
It helps to control and optimize the speed of data flow through the system and solve the inconsistency between the processing speed of production messages and consumption messages.
(4) Flexibility & peak processing power
In the case of a sharp increase in traffic, applications still need to continue to play a role, but such burst traffic is not common. It would be a huge waste to put resources on standby to handle such peak visits. Using message queuing can make key components withstand the sudden access pressure without completely crashing due to sudden overloaded requests.
(5) Asynchronous communication
Many times, users do not want or need to process messages immediately. Message queuing provides an asynchronous processing mechanism that allows users to put a message on the queue without processing it immediately. Put as many messages into the queue as you want, and then process them when needed.
3. Two modes of message queuing
1) Point to point mode (one-to-one, consumers actively pull data, and the message is cleared after receiving the message)
The message producer sends the message to the message queue, and then the message consumer takes it out of the message queue and consumes the message. After the message is consumed, there is no storage in the message queue, so it is impossible for the message consumer to consume the consumed message. Message queuing supports the existence of multiple consumers, but for a message, only one consumer can consume.
(2) Publish / subscribe mode (one to many, also known as observer mode. Messages will not be cleared after consumer consumption data)
The message producer (publisher) publishes the message to the topic, and multiple message consumers (subscribers) consume the message at the same time. Unlike the peer-to-peer method, messages published to topic will be consumed by all subscribers.
Publish / subscribe mode defines a one to many dependency between objects, so that whenever the state of an object (target object) changes, all objects (observer object) that depend on it will be notified and updated automatically.
4. Characteristics of Kafka
● high throughput and low latency
Kafka can process hundreds of thousands of messages per second, and its latency is as low as a few milliseconds. Each topic can be divided into multiple partitions. The Consumer Group can consume partitions to improve load balancing and consumption capacity.
●Scalability kafka Cluster supports hot expansion ●Persistence and reliability Messages are persisted to local disk, and data backup is supported to prevent data loss ●Fault tolerance Allow nodes in the cluster to fail (in case of multiple replicas, if the number of replicas is) n,Then allow n-1 Nodes failed) ●High concurrency Support thousands of clients to read and write at the same time
5. Kafka system architecture
1)Broker
A kafka server is a broker. A cluster consists of multiple brokers. A broker can accommodate multiple topic s.
(2)Topic
It can be understood as a queue, and both producers and consumers are oriented to one topic.
Similar to the database table name or ES index
Physically, messages of different topic s are stored separately
(3)Partition
In order to achieve scalability, a very large topic can be distributed to multiple broker s (i.e. servers). A topic can be divided into one or more partitions, and each partition is an ordered queue. Kafka only guarantees that the records in the partition are in order, not the order of different partitions in the topic.
4, Deploy Kafka cluster
1. Download the installation package
Official download address: http://kafka.apache.org/downloads.html
cd /opt wget https://mirrors.tuna.tsinghua.edu.cn/apache/kafka/2.7.1/kafka_2.13-2.7.1.tgz
Install kafka
tar zxvf kafka_2.13-2.7.1.tgz mv kafka_2.13-2.7.1 /usr/local/kafka
2. Modify profile
cd /usr/local/kafka/config/ cp server.properties{,.bak}
vim server.properties broker.id=0 ●21 that 's ok, broker Globally unique number of each broker It cannot be repeated, so it should be configured on other machines broker.id=1,broker.id=2 listeners=PLAINTEXT://192.168.80.10:9092 line 31 specifies the IP and port to listen to. If the IP of each broker needs to be modified separately, the default configuration can be maintained without modification num.network.threads=3 #Line 42, the number of threads that the broker processes network requests. Generally, it does not need to be modified num.io.threads=8 #Line 45, the number of threads used to process disk IO. The value should be greater than the number of hard disks socket.send.buffer.bytes=102400 #Line 48, buffer size of send socket socket.receive.buffer.bytes=102400 #Line 51, buffer size of receive socket socket.request.max.bytes=104857600 #Line 54, the buffer size of the request socket log.dirs=/usr/local/kafka/logs #Line 60: the path where kafka operation logs are stored is also the path where data is stored num.partitions=1 #In line 65, the default number of partitions of topic on the current broker will be overwritten by the specified parameters when topic is created num.recovery.threads.per.data.dir=1 #69 lines, the number of threads used to recover and clean data log.retention.hours=168 #In line 103, the maximum retention time of segment file (data file), in hours, defaults to 7 days, and the timeout will be deleted log.segment.bytes=1073741824 #110 lines. The maximum size of a segment file is 1G by default. If it exceeds, a new segment file will be created zookeeper.connect=192.168.80.10:2181,192.168.80.11:2181,192.168.80.12:2181 ●123 Rows, configuring connections Zookeeper Cluster address
scp -r kafka/ 192.168.50.37:/usr/local scp -r kafka/ 192.168.50.40:/usr/local
3. Modify environment variables
vim /etc/profile export KAFKA_HOME=/usr/local/kafka export PATH=$PATH:$KAFKA_HOME/bin source /etc/profile
4. Configure Zookeeper startup script
vim /etc/init.d/kafka
#!/bin/bash #chkconfig:2345 22 88 #description:Kafka Service Control Script KAFKA_HOME='/usr/local/kafka' case $1 in start) echo "---------- Kafka start-up ------------" ${KAFKA_HOME}/bin/kafka-server-start.sh -daemon ${KAFKA_HOME}/config/server.properties ;; stop) echo "---------- Kafka stop it ------------" ${KAFKA_HOME}/bin/kafka-server-stop.sh ;; restart) $0 stop $0 start ;; status) echo "---------- Kafka state ------------" count=$(ps -ef | grep kafka | egrep -cv "grep|$$") if [ "$count" -eq 0 ];then echo "kafka is not running" else echo "kafka is running" fi ;; *) echo "Usage: $0 {start|stop|restart|status}" esac
5. Set startup and self startup
chmod +x /etc/init.d/kafka chkconfig --add kafka Start separately Kafka service kafka start
6. Kafka command line operation
establish topic kafka-topics.sh --create --zookeeper 192.168.80.10:2181,192.168.80.11:2181,192.168.80.12:2181 --replication-factor 2 --partitions 3 --topic test ------------------------------------------------------------------------------------- --zookeeper: definition zookeeper Cluster server address, if there are multiple IP Addresses are separated by commas, usually one IP that will do --replication-factor: Define the number of partition replicas. 1 represents a single replica, and 2 is recommended --partitions: Define number of partitions --topic: definition topic name ------------------------------------------------------------------------------------- View all in the current server topic kafka-topics.sh --list --zookeeper 192.168.50.20:2181,192.168.50.37:2181,192.168.50.40:2181 View a topic Details of kafka-topics.sh --describe --zookeeper 192.168.50.20:2181,192.168.50.37:2181,192.168.50.40:2181 Release news kafka-console-producer.sh --broker-list 192.168.50.20:9092,192.168.50.37:9092,192.168.50.40:9092 --topic test Consumption news kafka-console-consumer.sh --bootstrap-server 192.168.50.20:9092,192.168.50.37:9092,192.168.50.40:9092 --topic test --from-beginning --from-beginning: All previous data in the topic will be read out Modify the number of partitions kafka-topics.sh --zookeeper 192.168.50.20:2181,192.168.50.37:2181,192.168.50.40:2181 --alter --topic test --partitions 6 delete topic kafka-topics.sh --delete --zookeeper 192.168.50.20:2181,192.168.50.37:2181,192.168.50.40:2181 --topic test
Create a topic and view all topics in the current server
View the details of a topic
Consumption news
5, Deploy Zookeeper+Kafka cluster
1. Deploy Filebeat
cd /usr/local/filebeat vim filebeat.yml filebeat.prospectors: - type: log enabled: true paths: - /etc/httpd/logs/access_log #Add configuration output to Kafka output.kafka: enabled: true hosts: ["192.168.50.20:9092","192.168.50.37:9092","192.168.50.40:9092"] #Specify Kafka cluster configuration topic: "kafka_test" #Specify topic for Kafka
2. Deploy ELK and create a new Logstash configuration file on the node where the Logstash component is located
vim filebeat.conf input { kafka { bootstrap_servers => "192.168.50.20:9092,192.168.50.37:9092,192.168.50.40:9092" topics => "kafka_test" group_id => "test123" auto_offset_reset => "earliest" } } output { elasticsearch { hosts => ["192.168.50.33:9200"] index => "filebeat_test-%{+YYYY.MM.dd}" } stdout { codec => rubydebug } }
Browser access http://192.168.50.33:5601 Log in to Kibana, click the "Create Index Pattern" button to add the index "filebeat_test - *", click the "create" button to create, and click the "Discover" button to view the chart information and log information.