Detailed explanation of Kafka core concepts II

Detailed explanation of Kafka core concepts_ Step on the blog - CSDN blog


This article inherits the core concepts of topic and kafka, including cluster fragmentation, persistent message storage mechanism and leader election mechanism; This article continues to analyze topic, message storage mechanism, and production produce rs and consumers   Detailed explanation of the principle of Consumer, and transaction mode

Message storage mechanism

message number

Each message will be given one in each fragment Incremental sequence number, which is saved in the index file
Consumer offffset (s / N offset)

Consumer offset,   Serial number offset. Every consumer has a serial number and offset when consuming. There is no need to confirm the message, because there is a serial number to judge where the consumer has visited.

TimeIndex time index

If we want to consume news from a certain moment
  • The message should have a timestamp
  • Time indexing .timeindex data structure { Timestamp, sequence number }
This is a lot of places, including whether we are developing projects or not. You need to save the time, because data analysis or anything will appear many times. Data analysis and viewing need to be queried according to the time dimension. This is later. Continue time indexing.
When the message starting at a certain time is to be obtained, it is obtained from the time index according to the time stamp >= The sequence number of the first message at that time; Then pull data from the serial number.
Does this feel like an index in a database.
Broker About Timestamp Global default configuration parameters for:

  You can configure the creation time or log writing time. The default is the message creation time

Topic About Timestamp Configuration parameters:


How to find the offset of segmentation in consumers is through the following methods.

  The returned partition and the start number of the partition   Finally, you can get the services you can consume according to this


  Configuration parameters of Topic

Official document parameter configuration:   Apache Kafka

Parameter modification

And Topic The associated configuration includes both server defaults and optional each Topic Override value. If you don't give each Topic The server default value will be used. By providing one or more-- config my-topic


> bin/ --zookeeper localhost:2181 --create --topic my-topic 
-- partitions 1 --replication-factor 1 --config max.message.bytes=64000 --config flush.messages=1

  You can also change or set the override value later by using the alter configs command. This example resets my topic

> bin/ --zookeeper localhost:2181 --entity-type topics 
--entity- name my-topic --alter --add-config max.message.bytes=128000
You can perform the following operations to check topic Set override value
bin/ --zookeeper localhost:2181 --entity-type topics --entity- name my-topic --describe
You can delete an override value by doing the following
> bin/ --zookeeper localhost:2181 --entity-type topics 
-- entity-name my-topic --alter --delete-config max.message.bytes

Delete Topic

[root@node4 latest]# bin/ --create --bootstrap-server
 localhost:9092 --replication-factor 1 --partitions 1 --topic test-1
# Send a message [root@node4 latest]# bin/ --broker-list localhost:9092 --topic test-1
Delete Topic
[root@node4 latest]# bin/ --delete --bootstrap-server
 localhost:9092 --topic test-1
When Topic Is a newly created empty Topic Time, meta information and Topic All storage directories are deleted.
When Topic When there is data, the meta information is displayed in the zookeeper The is deleted and the data storage directory is identified as delete
drwxr-xr-x. 2 root root 141 8 June 26-20:49 test-1- 0.20bb2388c6f04c34b79419411bc1bda6-delete


Producers can publish data to the selected topic (subject). The producer is responsible for assigning records to topic Which one
partition (in pieces). Made by the producer.
public static void main(String[] args) throws Exception {
		// Description of the configuration parameters of Producer:
		Properties props = new Properties();
		props.put("bootstrap.servers", "");
		// The client ID allows the Broker to better distinguish clients. It is not required.
		props.put(CommonClientConfigs.CLIENT_ID_CONFIG, "client-1");
		// How the broker replies to the release confirmation
		props.put("acks", "all");
		// Retry several times when sending fails
		props.put("retries", 0);
		// Producer uses the batch method to improve the transmission throughput. Here, the batch size is specified in bytes
		props.put("batch.size", 16384);
		// The waiting time for batch sending (i.e. how long to wait after a piece of data is received to form a batch sending of other data to be sent), which is a supplement to the above size.
		props.put("", 1);
		// The size of the buffer in which the data is stored
		props.put("buffer.memory", 33554432);
		// Serializer for key
		props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
		// Serializer for message data
		props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

		try (Producer<String, String> producer = new KafkaProducer<>(props);) {

			for (int i = 0; i < 100; i++) {
				String message = "message-" + i;
				// producer sends messages in asynchronous batch mode, and the send method will return immediately.
				Future<RecordMetadata> resultFuture = producer.send(new ProducerRecord<String, String>("test", Integer.toString(i), message));

				// If you want to synchronize blocking, wait for the results
				RecordMetadata rm = resultFuture.get();

				System.out.println("send out:" + message + " hasOffset: " + rm.hasOffset() + " partition: " + rm.partition() + " offset: " + rm.offset());

Producer message release confirmation mechanism

Among them, we will configure parameters
// How the broker replies to the release confirmation
props.put("acks", "all");

There is the meaning of configuration parameters in produceconfig.  

Set whether the server feedback is required for sending data , There are three values [all,0,1,-1 ] , default 1
  • 0: the producer will not wait for the broker to send an ack
  • 1: After receiving the message, the leader sends an ack
  • All: after receiving the message, the leader waits for all in sync replicas to be synchronized and sends an ack. This provides the best reliability. This means the leader will wait for the full set of in-sync replicas to acknowledge the record. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee. This is equivalent to the acks=-1 setting.
  • -1: Equivalent to all

Acknowledgement result of Callback asynchronous processing

Obtain the confirmation result asynchronously.

When sending, you can be completely non blocking and process the sending result by specifying callback, but when using transaction, it is not necessary to use
Callback , because the results are processed in the transaction method.
The returned result recordmetadata is normal and exception is abnormal; Only one of the two returns


  There is no message data we need in RecordMetadata. Properties can be seen

You need to make a record correlation to know that the data is that message.

You want to know the exceptions, including the timeout exception, the server exception not found, and so on.

  Using in spring

Send messages directly using Kafkatemple, also through

/*** Set a {@link ProducerListener} which will be invoked when Kafka acknowledges 
* a send operation. By default a {@link LoggingProducerListener} is configured
 * which logs errors only. 
* @param producerListener the listener; may be {@code null}. 
public void setProducerListener(@Nullable ProducerListener<K, V> producerListener) { 
    this.producerListener = producerListener;

You can give us the corresponding records

Producer interceptor interceptor

If you want to add additional unified processing logic to the message sending process, you can provide ProducerInterceptor Implementation, the idea of aop.

The order of interception has no priority, only the order of addition

Mode of use

Specify the implementation class name through interceptor.classes:

//Configure ProducerInterceptor 
props.put(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG, ",");

Idempotent mode

That is, how to send messages only once to prevent repeated messages

from Kafka 0.11 Start, KafkaProducer Two other modes are supported : Idempotent generator and transaction generator. Idempotent generator will Kafka's Transitive semantics is enhanced from at least once to exactly once. In particular, producer retries will no longer introduce duplication. The transaction generator allows applications to partition to multiple partitions atomically( And theme !) Send a message.
Configure the producer parameter enable.identotence to true.
retries acks Don't configure it, because there will be an automatic default value: Integer.MAX_VALUE all.
You cannot resend the same business data in business.

Transaction mode

Transaction mode is used to send to multiple partitions and multiple Topic Multiple messages are atomic.
Ensure that the data sent to multiple topic s are atomic, successful or failed
Transaction mode requirements:
  • To use the transaction pattern and the corresponding api, the transactional. ID attribute must be set. After the is configured, idempotency will be enabled automatically, and the generator configuration on which idempotency depends will be enabled. is the transaction ID used to enable transaction recovery across multiple sessions of a single producer instance. It should be unique for each instance of the generator running in the partitioned application.
  • Kafka cluster needs at least 3 nodes
  • In order to achieve transaction assurance from end to end, the consumer must also be configured to read only committed messages.
The producer is thread safe to ensure transaction, and the api is synchronously blocked.


Configure the isolation level, including read commit, etc.


Consumers, in Kafka, consumers use the poll pull mode to obtain messages, and the data will always exist if it is not deleted.

public static void main(String[] args) {
		// For consumer parameters, see:
		Properties props = new Properties();
		props.put("bootstrap.servers", "localhost:9092");
		props.put(CommonClientConfigs.CLIENT_ID_CONFIG, "client-1");
		// Set consumption group
		props.put("", "test");
		// Enable automatic consumption offset submission
		// If this value is set to true, the consumer will periodically save the offset value of the current consumption to zookeeper. When the consumer fails to restart, this value will be used as the value for new consumption.
		props.put("", "true");
		// The interval between automatic consumption and offset submission
		props.put("", "1000");
		// Deserializer for key
		props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
		// Deserializer for message
		props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

		String topic = "test";

		try (KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);) {
			// Subscribe to topics

			while (true) {
				// Kafka is a pull mode. The time parameter of poll tells Kafka: if there is no data at present, how long to wait before responding
				ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(1L));
				for (ConsumerRecord<String, String> record : records)
					System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());

				// Set where to start pulling messages (if necessary)
				// TopicPartition(topic, 0), 0);

Consumer is non thread safe. In the case of multithreading, one thread has one consumer.

Consumer group

The concept of consumption group, why this concept, to achieve the concept of consumption load balance.

  • Consumers are identified by a consumer group name. Each record published to topic is assigned to a consumer instance in the subscription consumer group. Consumer instances can be distributed in multiple processes or on multiple machines.
  • If all consumer instances are in the same consumer group, the message record will be load balanced to each consumer instance
  • If all consumer instances are in different consumer groups, each message record will be broadcast to all consumer processes



  • Generally, each topic will have some consumption groups. One consumption group corresponds to a "logical subscriber". A consumer group consists of many consumer instances, which is easy to expand and fault-tolerant. This is the concept of publish and subscribe, except that subscribers are a group of consumers rather than a single process.
  • The way to realize consumption in Kafka is to divide the fragments in the log into each consumer instance, so that each instance is the only consumer in the fragment at any time. Maintaining the consumption relationship in the consumption group is dynamically processed by Kafka protocol. If new instances join the group, they will take over some partition partitions from other members of the group; If an instance disappears, the owned partition will be distributed to the remaining instances.
  • Kafka only guarantees that the records in the partition are orderly, but does not guarantee the order of different partitions in the subject. If you want all messages to be consumed in order, you can use a topic with only one partition, which means that there is only one consumer process for each consumer group.

group rebalance

Here, the group rebalancing and consumption group rebalancing delay parameters are set

Event triggering the counterbalance:

  • Changes in the number of consumers
  • Increase or decrease of slice

Will lead to the start of this balance, which is the reason for the balance.

How to specify the consumption group id in spring

Directly refers to the default id we set, that is, the consumer group id

  How does Kafka know that consumers are offline and need rebalance?

It also comes from heartbeat and session.  

Beatheart Session

Heartbeat and session ,  ,

There are default values here. There's a heartbeat to monitor. The timeout time of the session will cause the update of the consumer's status

Consumption offset

When the consumer starts to consume messages, the parameters can be configured, auto.offset.reset   

What should I do when there is no initial offset in ZooKeeper or if the offset is out of range;

  • smallest: automatically reset the offset to the minimum offset;
  • Large: automatically resets the offset to the maximum
  • Offset: throws an exception to the consumer
//Set where the created consumer starts consuming messages
 props.put("auto.offset.reset", "smallest");

Automatic submission manual submission

Do you start from scratch every time you start? How do I start where I ended last time?

Auto submit

Turn on auto submit
// Enable automatic consumption offset submission
 // If this value is set to true, the consumer will periodically save the offset value of the current consumption to zookeeper. When the consumer fails to restart, this value will be used as the value for new consumption. props.put("", "true"); 
// The interval between automatic consumption and offset submission
 props.put("", "1000");
  • Repeated consumption
  • Lost data (lost message)

This is the problem of automatic submission. Only judge and pull again.

Manual submission

Use code

//Set manual submission props.put("", "false");
 //Manually synchronize submit consumer offset to zookeeper 

Here, you can catch exceptions when throwing exceptions.

Submission method of Consumer

There are synchronous and asynchronous methods, as well as obtaining the data submitted last time offffset Data processing method

Flexibly submit consumption offset value by slice  

When you subscribe to multiple topic A pull may return a message set of multiple topics and multiple fragments. The above manual submission is that all fragments are submitted together. If necessary, we can control the submission more finely. The following code example shows that messages are processed according to fragments, and the consumption offset of the fragment is submitted before one is processed Value. [note] offffset Value is the starting value of the next pull( lastOffffset + 1)

Control the location of consumption

There are two situations where you need to control the location of consumption:
  • The consumer program stops running at a certain time and continues to consume messages since that time after restarting. Specify the location of the consumer from a certain time
  • The consumer may have stored the consumption message locally, but the local storage is broken. They want to retrieve a copy locally
In the case of controlling the consumption location, the consumption is generally carried out by using the slice method of specified consumption, and the location is specified.
This is related to kafka fragmentation.
String topic = "foo"; 
TopicPartition partition0 = new TopicPartition(topic, 0); 
TopicPartition partition1 = new TopicPartition(topic, 1); consumer.assign(Arrays.asList(partition0, partition1));


Get data according to a certain time

 subscribe vs assign

Subscriptions and assignments   The difference between the two in use

  • subscribe: the segments consumed by the consumer are dynamically specified by Kafka according to the consumption group, and can be dynamically rebalance d
  • assign: the user specifies which slices to consume, which is not affected by rebalance
// Not a subscription Topic
 // consumer.subscribe(Arrays.asList("test", "test-group")); 
// Instead, the consumer is directly assigned to read some fragment subscriptions and assign. Only one of them can be used. The assignment is not affected by rebalance. 
TopicPartition partition = new TopicPartition("test", 0); 

Balance can be perceived through consumerRebalanceListener.

  poll settings

When pulling data, how much data is returned?

Whether to return all data or part of data can be set.

Consumer parameters.
  • fetch.min.bytes   The minimum amount of data that the server should return for a get request. If there is not enough data available, the request will wait to accumulate enough data before responding to the request. The default setting is 1 byte, which means that as long as one byte of data is available, or the extraction request times out while waiting for the data to arrive, the request will be responded to. Setting this value to a value greater than 1 will cause the server to wait for a greater amount of data accumulation, which can slightly increase the cost of server throughput and additional latency.
  • max.partition.fetch.bytes    The maximum amount of data per partition that the server will return. Records are obtained by users in batches. If the first batch of records in the first non empty partition obtained is greater than this limit, the batch will still be returned to ensure that the user can make progress. The maximum record batch size accepted by the agent is defined by message.max.bytes (agent configuration) or max.message.bytes (subject configuration). For information on limiting the size of user requests, see fetch.max.bytes.
  • fetch.max.byte     The maximum amount of data that the server should return for a fetch request. Records are obtained by users in batches. If the first batch of records is located in the first non empty partition obtained and is greater than this value, the batch of records will still be returned to ensure that users can make progress. Therefore, this is not an absolute maximum. The maximum record batch size accepted by the agent is defined through message.max.bytes (agent configuration) or max.message.bytes (topic configuration). Note that users perform multiple fetches in parallel.

  • max.poll.records   The maximum number of records returned in a single call to poll().

Here you can see the parameter configuration on the official website.

This parameter is used to prevent repetition and always take the value of a partition.

Flow Control of consumption

When we specify to consumers to get data from multiple slices, once poll It will pull data from all specified tiles at the same time. However, in some cases, we may need to consume a subset of the specified tiles at full speed, and then start consuming other tiles when this subset has only a small amount or no data.
Such scenarios include:
  • In streaming processing, the program needs to perform a join operation on the stream data in two topics, and the production speed of one topic is faster than that of the other. At this time, it is necessary to reduce the consumption speed of the fast topic to match the slow topic.
  • Another scenario: when starting the consumer, there are already a large number of messages piled up in these specified topics. The program needs to give priority to the topics containing the latest data, and then deal with the topics of old data.
Kafka of Consumer Support through pause(Collection) and resume(Collection) To dynamically control the consumption flow rate.
  • pause(Collection partitions) pauses the pull of a subset
  • resume(Collection partitions) restores the pull of subsets
Next call poll(Duration) They take effect when.


Related to producer usage transactions. The point is the isolation level.
Consumer's configuration parameters
Kafka Chinese document - ApacheCN


API in Spring

  • @KafkaListener
  • @TopicPartition
  • @PartitionOffffset
  • KafkaListenerErrorHandler



Tags: kafka Distribution

Posted on Sat, 30 Oct 2021 05:00:51 -0400 by SJones