Kafka producer - concept overview | configuration parameters | serialization | partition -- Notes on Kafka authoritative guide

Kafka producer - writes data to Kafka

In addition to the built-in client, Kafka also provides a binary connection protocol, that is, we can read messages from or write messages to Kafka by directly sending appropriate byte sequences to the Kafka network port. Therefore, there are many Kafka clients implemented in languages, such as C + +, Python and go, which are not limited to Java.

Producer overview

In many cases, an application needs to write messages to Kafka: record user activities, save log information, record home appliance information, asynchronously communicate with other applications, buffer data to be written to the database, and so on.

Message sending process

Although the producer API is simple to use, the process of sending messages is quite complex.

1) We start by creating a ProducerRecord object. The ProducerRecord object needs to contain the target topic and content to be sent, and you can also specify keys and partitions.

2) When sending PR, the producer needs to serialize the key value object into a byte array before it can be transmitted on the network.

3) Next, the data is passed to the partition. If the partition has been specified in PR, the partitioner will not do anything again and directly return the specified partition. If not, the partitioner will select a partition according to the key in PR.

4) Then, this record will be added to a record batch, and all messages of this batch will be sent to the same subject and partition.

5) Then, an independent thread sends these record batches to the corresponding broker.

6) When the server receives these messages, it will return a response. If the message is successfully written to Kafka, it will return a RecordMetaData object, which contains the subject and partition information, as well as the offset recorded in the partition.

Create Kafka producer

To write to Kafka, first create a producer object and set some properties. Kafka producer has 3 required attributes.

bootstrap.servers

This attribute specifies the address list of the broker. The address format is: host:port. It is recommended to provide two broker information, one of which is down, and the producer can still connect to the cluster.

key.serializer

The key values of the messages that the broker wants to receive are byte arrays, but the producer needs to convert the Java objects wrapped by the key values into byte arrays. The Kafka client provides serializers for ByteArraySerializer, StringSerializer and IntegerSerializer by default.

value.serializer

If the key is an integer type and the value is a string, you need to use a different serializer.

Properties kafkaProps = new Properties();
        
        kafkaProps.put("bootstrap.servers","hadoop102:9092,hadoop103:9092");
        kafkaProps.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
        kafkaProps.put("value.serializer","org.apache.kafka.common.serialization.StringSerializer");

        KafkaProducer<String,String> producer = new KafkaProducer<String, String>(kafkaProps);

At this point, after instantiating the producer, you can start sending messages.

There are three ways to send messages.

  • Send and forget (fire and forget)
    • We send the message to the server, but we don't care whether it arrives or not.
  • Synchronous transmission
    • We use the send() method to send a message, which will return a Future object. Call the get() method to wait to know whether the message has been sent successfully.
  • Asynchronous transmission
    • We call send() to send and specify a callback function that the server will call when it returns a response.
		//Send and forget
        producer.send(record);

        //Synchronous transmission
        try {
            RecordMetadata metadata = producer.send(record).get();
            Long offset = metadata.offset();
        } catch (Exception e) {
            e.printStackTrace();
        }

        //Asynchronous transmission
        producer.send(record, new Callback() {
            @Override
            public void onCompletion(RecordMetadata recordMetadata, Exception e) {
                if (e != null){
                    e.printStackTrace();
                }
            }
        });

Producer configuration

  • acks
    • The acks parameter indicates how many copies of the message must be received before the producer considers the message to be written successfully. This parameter has an important impact on the possibility of message loss.
      • If acks = 0 is set, the producer will not wait for any server response before successfully writing the message. However, if the message is lost, the producer will not know. The advantage is to send the message at the maximum speed, so as to achieve high throughput.
      • If acks = 1 is set, as long as the cluster leader node receives the message, the producer will receive a successful response from the server. However, if the message cannot reach the leader node (the leader node crashes and the new leader is not elected), the producer will receive an error response and resend it. However, if a new leader does not receive a message, the message will still be lost.
      • If acks = all is set, the producer will receive a successful response from the server only when all nodes participating in the replication receive messages. This mode is the safest, but the latency is higher than acks=1.
  • buffer.memory
    • Used to set the size of the producer's memory buffer, which is used by the producer to buffer messages to be sent to the server. If the application sends messages faster than the sending server, it will lead to insufficient producer space. At this time, the send() method call either blocks or throws an exception, depending on how the block.on.buffer.full parameter is set
  • compression.type
    • By default, messages are not compressed when sent. This parameter can be set to snappy,gzip, which specifies which compression algorithm to use before the message is sent to the broker.
      • snappy compression algorithm is invented by Google. It occupies less CPU and can provide better performance and considerable compression ratio to consider performance and network bandwidth.
      • gzip compression algorithm will generally occupy more CPU, but it will provide higher compression ratio. Therefore, if the network bandwidth is limited, this algorithm can be used.
    • Using compression can reduce network transmission and storage overhead, which is often the bottleneck of sending messages to Kafka.
  • retries
    • Because the producer may receive temporary errors when sending messages to the server (for example, the partition cannot find the leader). In this case, the retries parameter determines the number of times that the producer can resend messages. By default, the number of retries per time waits for 100ms, but this time interval can be changed through the retry.backoff.ms parameter.
      • We can first test how long it takes to recover a crashed node.
  • batch.size
    • This parameter specifies the memory size that can be used by a batch, calculated in bytes. When the batch is filled, all the information in the batch will be sent out. Of course, it may not be sent when it is full (also consider the linker.ms parameter). If it is too small, it will occupy memory. If it is too small, it will send messages frequently, increasing the overhead.
  • linger.ms
    • This parameter specifies the time for the producer to wait for more messages to join the batch before sending the batch. KafkaProducer will send the batch when the batch is full or linker.ms reaches the upper limit.
      • If the number is set to be greater than 0, the delay will be increased, but the throughput will be improved. (because more messages are sent at once, the overhead per message becomes smaller.
  • max.in.flight.requests.per.connection
    • This parameter specifies how many messages the producer can send before receiving the server response. The higher its value, the more memory it will occupy, but it will improve the throughput.
  • timeout.ms,request.timeout.ms,meta.fetch.timeout.ms
    • req.ms specifies the time that the producer waits for the server to return a response when sending a message
    • meta.ms specifies the time when the server returns a response when the producer obtains metadata
    • time.ms specifies the time for the broker to wait for the synchronous copy to return the message confirmation, which matches the configuration direction of acks - if the synchronous copy confirmation is not received within the specified time, the broker will return an error.
  • max.request.size
    • It is used to control the request size sent by the producer. It is the maximum value that can send a single message. In addition, the broker also has its own limit on the maximum value of messages that can be received, message.max.bytes, so the configurations on both sides should be matched to avoid the message sent by the producer being rejected by the broker.

Sequence guarantee

Kafka can ensure that the order in the same partition is consistent.

Some scenarios are very important for order, such as banks.

Therefore, if the retries parameter is set to a non-zero integer and max.in.flight.requestset.per.connection is set to a number greater than 1, if the message writing of the first batch fails and the message writing of the second batch succeeds, the broker will rewrite the first batch. Then the order of the two batches will be reversed.

Therefore, if some scenarios require messages to be ordered, it is also important to write messages successfully. Therefore, it is not recommended to set retes to 0, but request.per.connection can be set to 1, so that when the producer sends the first batch of messages, no other messages will be sent to the broker.

Serializer

We mentioned earlier that to instantiate a producer, we must specify a serializer. Although Kafka provides the default string serializer, integer serializer and byte array serializer, it is not enough to meet the needs of most scenarios, because we need to serialize more and more record types and become more and more complex.

Custom serializer

If the objects we send to Kafka are not simple strings and integers, we can use a serialization framework, such as Arvo, Thrift, Protobuf, or a custom serializer. Of course, it's best to use an off the shelf serialization framework.

If the message we need to send is a Customer object

/**
 * @Author Juniors Lee
 * @Date 2021/11/16
 */
public class Customer {

    private int customID;

    private String customerName;

    public int getCustomID() {
        return customID;
    }

    public void setCustomID(int customID) {
        this.customID = customID;
    }

    public String getCustomerName() {
        return customerName;
    }

    public void setCustomerName(String customerName) {
        this.customerName = customerName;
    }
}

So we want to customize the serializer.

/**
 * @Author Juniors Lee
 * @Date 2021/11/16
 */
public class CustomerSerializer implements Serializer<Customer> {

    @Override
    public void configure(Map<String, ?> map, boolean b) {
        
        //No configuration
    }

    @Override
    /**
     * Customer Objects are serialized into:
     * A 4-byte integer representing customerID
     * A 4-byte integer representing the length of customerName
     * N bytes representing customerName
     */
    public byte[] serialize(String s, Customer customer) {
        
        byte[] serializedName;
        int stringSize;
        
        if (customer == null)
            return null;
        else {
            if (customer.getCustomerName() != null){
                serializedName = customer.getCustomerName().getBytes(StandardCharsets.UTF_8);
                stringSize = serializedName.length;
            }else {
                serializedName = new byte[0];
                stringSize = 0;
            }
        }
        ByteBuffer buffer = ByteBuffer.allocate(4 + 4 + stringSize);
        buffer.putInt(customer.getCustomID());
        buffer.putInt(stringSize);
        buffer.put(serializedName);
        
        return buffer.array();
    }

    @Override
    public void close() {

        //You don't need to close anything
    }
}

Serialization using Avro

Apache Avro is a programming language independent serialization format. Doug Cutting created this project for the purpose of sharing data files.

Avro data is defined through a language independent schema. Schema is described by JSON, and the data is serialized into binary or JSON files.

An interesting feature of Avro is that when the application responsible for writing messages uses the new schema, the application responsible for reading can continue to process messages without making any changes, which is very suitable for Kafka.

partition

The ProducerRecord object includes the target topic, key, and value. Kafka's messages are key value pairs. Of course, the key can be null. However, most applications need keys. First, they are used as incidental information of the message. Second, they can be used to determine which partition the message should be written to. The same keys are assigned to the same partition. A process reads data from only one topic.

If the key value is null and the default partition is used, the record will be randomly sent to each available partition in the topic. The partitioner uses the round robin algorithm to publish messages evenly on each partition.

Of course, if the key is not null and the default partition is used, Kafka will hash the key, and then map the message to a specific partition according to the hash value. In this mapping, we will use all partitions of the topic, not just the available partitions. This also means that if the partition to which data is written may also be unavailable, of course, this rarely happens.

Only without changing the number of topic partitions can the mapping between keys and partitions remain unchanged, so it is best to plan the partitions when creating topics.

Custom divider Demo

/**
 * @Author Juniors Lee
 * @Date 2021/11/16
 */
public class BananaPartitioner implements Partitioner {

    @Override
    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster){

        List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
        int numPartitions = partitions.size();
        
        if ((keyBytes == null) || (!(key instanceof  String)))
            throw new InvalidRecordException("We expect all messages to have customer name as key")
        if (((String)key).equals("Banana"))
            return numPartitions; //Banana is always assigned to the last partition
        
        //Other records are hashed in other partitions
        return (Math.abs(Utils.murmur2(keyBytes)) % (numPartitions - 1));

    }

    @Override
    public void close() {

    }

    @Override
    public void configure(Map<String, ?> map) {

    }
}

Tags: Java kafka Back-end Distribution

Posted on Wed, 17 Nov 2021 06:01:27 -0500 by sysgenmedia