I. fault description
On September 22, one kafka in the national kafka cluster was shut down due to insufficient disk space. The business will be affected and unable to produce and consume messages. Program error:
WARN [Consumer clientId=consumer-1, groupId=console-consumer-55928] 1 partitions have leader brokers without a matching listener, including [baidd-0] (org.apache.kafka.clients.NetworkClient)
II. Fault simulation
2.1 When the replica of topic partition is 1
#Production message
[root@Centos7-Mode-V8 kafka]# bin/kafka-console-producer.sh --broker-list 192.168.144.247:9193,192.168.144.251:9193,192.168.144.253:9193 --topic baidd
>aa
>bb
#Consumption message:
[root@Centos7-Mode-V8 kafka]# bin/kafka-console-consumer.sh -bootstrap-server 192.168.144.247:9193,192.168.144.251:9193,192.168.144.253:9193 --topic baidd
aa
bb
2.1.1 simulate closing the leader node of the topic
#Use the kafka tool to view the node on which the leader of the topic partition is located
/*
You can also see it with the kafka command
bin/kafka-topics.sh --zookeeper 192.168.144.247:3292,192.168.144.251:3292,192.168.144.253:3292 --describe
The result output is as follows:
*/
Turn off its leader node and find that the producer and all consumer processes have been brushing the following information:
[2021-09-23 17:09:53,495] WARN [Consumer clientId=consumer-1, groupId=console-consumer-55928] 1 partitions have leader brokers without a matching listener, including [baidd-0] (org.apache.kafka.clients.NetworkClient)
Messages cannot be sent or consumed.
2.1.2 simulate turning off non leader nodes
Sometimes the consumer process will report an error: [2021-09-23 17:21:22480] warn [consumer ClientID = consumer-1, groupid = console-consumer-55928] connection to node 2147483645 (/ 192.168.144.253:9193) could not be established. Broker may not be available. (org. Apache. Kafka. Clients. Networkclient)
During the error reporting period, messages can be produced normally, but the data generated in the middle cannot be consumed.
2.1.3 summary
When the partition replicats is equal to 1, stopping any node will affect the business.
When the node of a partition leader goes down, production messages and consumption messages will be affected.
When the non leader node goes down, the consumption message will be affected.
2.2 partition with multiple replicas
When there are no other replicas of the partition, the impact on the business is understandable. Therefore, try to configure multiple replicas for topic and find that it still affects the business:
#Create a topic with three copies
bin/kafka-topics.sh --create --zookeeper 192.168.144.247:3292,192.168.144.251:3292,192.168.144.253:3292 --replication-factor 3 --partitions 1 --topic song
#View replica information
[root@Centos7-Mode-V8 kafka]# bin/kafka-topics.sh --zookeeper 192.168.144.247:3292,192.168.144.251:3292,192.168.144.253:3292 --describe --topic song
Topic:song PartitionCount:1 ReplicationFactor:3 Configs:
Topic: song Partition: 0 Leader: 0 Replicas: 0,2,1 Isr: 0,2,1
#Send a message
bin/kafka-console-producer.sh --broker-list 192.168.144.247:9193,192.168.144.251:9193,192.168.144.253:9193 --topic song
#Consumption process 1
bin/kafka-console-consumer.sh -bootstrap-server 192.168.144.247:9193,192.168.144.251:9193,192.168.144.253:9193 --topic song --group g1
#Consumption process 2
bin/kafka-console-consumer.sh -bootstrap-server 192.168.144.247:9193,192.168.144.251:9193,192.168.144.253:9193 --topic song --group g2
#Simulate closing the leader node of the topic
It is found that the message can still be produced. It is not reported that 1 partitions have leader brokers without a matching listener is wrong, but it is found that consumers sometimes report an error after they are unable to connect to the topic leader:
[2021-09-24 19:01:06,316] WARN [Consumer clientId=consumer-1, groupId=console-consumer-27609] Connection to node 2147483647 (/192.168.144.247:9193) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
Sometimes the data produced during this period does not come over, and the messages generated during node failure cannot be consumed.
But why do nodes still lose messages after multiple replicas are down?
Answer:__ consumer_ There is only one copy of offsets, which will make topic with multiple copies unable to achieve high availability.
#Later, by expanding the copy of kafka's own topic (_consumer_offsets), the high availability of other common topics can be realized. Although Broker may not be available is reported after a node is stopped, the business will no longer be affected.
III. fault location
default.replication.factor=3 is not configured in the Kafka configuration file, and this parameter defaults to 1, indicating that there are no other copies, so it is equivalent to a single point.
IV. solutions
4.1 modify the default.replication.factor parameter
Modify the configuration files of all kafka nodes and increase the default copy factor of topic (this parameter is 1 by default):
default.replication.factor=3
If default.replication.factor=3 is set, offsets.topic.replication.factor will also default to 3.
Note: do not set default.replication.factor=3 and offsets.topic.replication.factor=1, so that the value of offsets.topic.replication.factor will overwrite the value of default.replication.factor.
#Restart kafka to make the configuration effective
systemctl restart kafka
4.2 an expanded copy of the existing common topic
Can refer to https://blog.csdn.net/yabingshi_tech/article/details/120443647
4.3 for__ consumer_offset extended copy
The method is the same as above, and the json file is as follows:
{ "version": 1, "partitions": [ { "topic": "__consumer_offsets", "partition": 0, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 1, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 2, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 3, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 4, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 5, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 6, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 7, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 8, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 9, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 10, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 11, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 12, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 13, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 14, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 15, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 16, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 17, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 18, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 19, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 20, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 21, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 22, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 23, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 24, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 25, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 26, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 27, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 28, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 29, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 30, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 31, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 32, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 33, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 34, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 35, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 36, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 37, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 38, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 39, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 40, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 41, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 42, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 43, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 44, "replicas": [ 2, 0, 1 ] }, { "topic": "__consumer_offsets", "partition": 45, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 46, "replicas": [ 0, 1, 2 ] }, { "topic": "__consumer_offsets", "partition": 47, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 48, "replicas": [ 1, 2, 0 ] }, { "topic": "__consumer_offsets", "partition": 49, "replicas": [ 2, 0, 1 ] } ] }
--This article refers to: Kafka suddenly went down? Hold on, don't panic!