Kafka uses Java to realize the production and consumption demo of data

Introduction to Kafka

Kafka is a high-throughput distributed publish subscribe message system, which can process all action flow data in consumer scale websites.
Kafka has the following characteristics:

  • The message persistence capability is provided with a time complexity of O(1), which can ensure the access performance of constant time complexity even for data above TB.
  • High throughput. Even on very cheap commercial machines, a single machine can support the transmission of more than 100K messages per second.
  • Support message Partition and distributed consumption between Kafka servers, and ensure the sequential transmission of messages in each Partition.
  • It also supports offline data processing and real-time data processing.
  • Scale out: supports online horizontal expansion.

kafka terminology

  • Broker: the Kafka cluster contains one or more servers, which are called brokers.
  • Topic: each message published to the Kafka cluster has a category called topic. (physically, messages of different topics are stored separately. Logically, although messages of one topic are stored in one or more broker s, users only need to specify the topic of the message to produce or consume data without caring where the data is stored)
  • Partition: partition is a physical concept. Each Topic contains one or more partitions.
  • Producer: responsible for publishing messages to Kafka broker.
  • Consumer: a message consumer, a client that reads messages from Kafka broker.
  • Consumer Group: each consumer belongs to a specific Consumer Group (group name can be specified for each consumer. If group name is not specified, it belongs to the default group).

kafka core Api

kafka has four core API s

  • The application uses the producer API to publish messages to one or more topic s.
  • The application uses the consumer API to subscribe to one or more topic s and process the resulting messages.
  • The application uses the streams API as a stream processor to consume input streams from one or more topics and generate an output stream to one or more topics, effectively converting the input stream to the output stream.
  • The connector API allows you to build or run reusable producers or consumers and link topic s to existing applications or data systems.

The example diagram is as follows:

kafka application scenario

  • Build a real-time streaming data pipeline that reliably obtains data between systems or applications.
  • Build real-time streaming applications that can transform or respond to data streams.

Refer to kafka official documents for the above introduction.

Development preparation

If we want to develop a kafka program, what should we do?
First, after setting up the kafka environment, we should consider whether we are producers or consumers, that is, the sender or receiver of messages.
However, in this article, both producers and consumers will develop and explain.

After a general understanding of kafka, let's develop the first program.
The development language used here is Java and the build tool Maven.
Maven's dependencies are as follows:

 <dependency>
        <groupId>org.apache.kafka</groupId>
         <artifactId>kafka_2.12</artifactId>
         <version>1.0.0</version>
            <scope>provided</scope> 
        </dependency>
        
        <dependency>
             <groupId>org.apache.kafka</groupId>
             <artifactId>kafka-clients</artifactId>
              <version>1.0.0</version>
        </dependency>
        
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-streams</artifactId>
            <version>1.0.0</version>
        </dependency>

Kafka Producer

During development and production, briefly introduce kafka various configurations:

  • bootstrap.servers: the address of kafka.
  • acks: the message confirmation mechanism. The default value is 0.
    acks=0: if set to 0, the producer will not wait for kafka's response.
    acks=1: this configuration means that kafka will write this message to the local log file, but will not wait for a successful response from other machines in the cluster.
    acks=all: this configuration means that the leader will wait for all follower s to synchronize. This ensures that messages will not be lost unless all machines in the kafka cluster hang up. This is the strongest availability guarantee.
  • retries: if configured as a value greater than 0, the client will resend the message when it fails to send.
  • batch.size: when multiple messages need to be sent to the same partition, the producer will try to merge network requests. This will improve the efficiency of client s and producers.
  • key.serializer: key serialization. The default is org.apache.kafka.common.serialization.StringDeserializer.
  • value.deserializer: value serialization. The default is org.apache.kafka.common.serialization.StringDeserializer.
    ...
    There are more configurations. You can check the official documents, which are not explained here.
    Then our kafka producer configuration is as follows:
  Properties props = new Properties();
        props.put("bootstrap.servers", "master:9092,slave1:9092,slave2:9092");
        props.put("acks", "all");
        props.put("retries", 0);
        props.put("batch.size", 16384);
        props.put("key.serializer", StringSerializer.class.getName());
        props.put("value.serializer", StringSerializer.class.getName());
        KafkaProducer<String, String> producer = new KafkaProducer<String, String>(props);

After the configuration of kafka is added, we will start the production data. The production data code only needs to be as follows:

producer.send(new ProducerRecord<String, String>(topic,key,value));
  • Topic: the name of the message queue, which can be created in the kafka service first. If the topic is not created in kafka, it will be created automatically!
  • Key: the key value, that is, the value corresponding to value, is similar to Map.
  • value: the data to be sent. The data format is String.

After writing the producer program, let's produce it first!
The message I send here is:

String messageStr="Hello, this is the third"+messageNo+"Data bar";

And exit after sending only 1000 entries. The results are as follows:

You can see that the information has been printed successfully.
If you do not want to use the program to verify whether the program was sent successfully, and darkf The accuracy of message sending can be checked by using the command on the kafka server.

Kafka Consumer

kafka consumption should be the focus. After all, most of the time, we mainly use data for consumption.

The configuration of kafka consumption is as follows:

  • bootstrap.servers: the address of kafka.
  • group.id: group name. Different group names can be consumed repeatedly. For example, you first used the group name A to consume 1000 pieces of kafka data, but you still want to consume the 1000 pieces of data again and do not want to regenerate it. Here, you only need to change the group name to repeat the consumption.
  • enable.auto.commit: whether to submit automatically. The default value is true.
  • auto.commit.interval.ms: callback processing time from poll.
  • session.timeout.ms: timeout.
  • max.poll.records: the maximum number of records pulled at one time.
  • auto.offset.reset: consumption rule. The default is early.
    Earlist: when there are submitted offsets under each partition, consumption starts from the submitted offset; When there is no committed offset, consumption starts from scratch.
    latest: when there are submitted offsets under each partition, consumption starts from the submitted offset; When there is no committed offset, the newly generated data under the partition is consumed.
    none: topic when there are committed offsets in each partition, consumption starts after offset; As long as there is no committed offset in one partition, an exception is thrown.
  • key.serializer: key serialization. The default is org.apache.kafka.common.serialization.StringDeserializer.
  • value.deserializer: value serialization. The default is org.apache.kafka.common.serialization.StringDeserializer.

Then our kafka consumer configuration is as follows:

 Properties props = new Properties();
        props.put("bootstrap.servers", "master:9092,slave1:9092,slave2:9092");
        props.put("group.id", GROUPID);
        props.put("enable.auto.commit", "true");
        props.put("auto.commit.interval.ms", "1000");
        props.put("session.timeout.ms", "30000");
        props.put("max.poll.records", 1000);
        props.put("auto.offset.reset", "earliest");
        props.put("key.deserializer", StringDeserializer.class.getName());
        props.put("value.deserializer", StringDeserializer.class.getName());
        KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);

Since I set this as automatic submission, the consumption code is as follows:

We need to subscribe to a topic first, that is, specify which topic to consume.

consumer.subscribe(Arrays.asList(topic));

After subscribing, we pull data from kafka:

ConsumerRecords<String, String> msgList=consumer.poll(1000);

Generally speaking, monitoring is used for consumption. Here, we use for(;;) to monitor, and set 1000 pieces of consumption to exit!

The results are as follows:

We can see that we have successfully consumed the production data here.

code

Then the codes of producers and consumers are as follows:

producer:

import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;

/**
 * 
* Title: KafkaProducerTest
* Description: 
* kafka Producer demo
* Version:1.0.0  
*/
public class KafkaProducerTest implements Runnable {

    private final KafkaProducer<String, String> producer;
    private final String topic;
    public KafkaProducerTest(String topicName) {
        Properties props = new Properties();
        props.put("bootstrap.servers", "master:9092,slave1:9092,slave2:9092");
        props.put("acks", "all");
        props.put("retries", 0);
        props.put("batch.size", 16384);
        props.put("key.serializer", StringSerializer.class.getName());
        props.put("value.serializer", StringSerializer.class.getName());
        this.producer = new KafkaProducer<String, String>(props);
        this.topic = topicName;
    }

    @Override
    public void run() {
        int messageNo = 1;
        try {
            for(;;) {
                String messageStr="Hello, this is the third"+messageNo+"Data bar";
                producer.send(new ProducerRecord<String, String>(topic, "Message", messageStr));
                //Print after 100 pieces are produced
                if(messageNo%100==0){
                    System.out.println("Message sent:" + messageStr);
                }
                //Exit when 1000 pieces are produced
                if(messageNo%1000==0){
                    System.out.println("Successfully sent"+messageNo+"strip");
                    break;
                }
                messageNo++;
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            producer.close();
        }
    }
    
    public static void main(String args[]) {
        KafkaProducerTest test = new KafkaProducerTest("KAFKA_TEST");
        Thread thread = new Thread(test);
        thread.start();
    }
}

consumer:

import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;


/**
 * 
* Title: KafkaConsumerTest
* Description: 
*  kafka Consumer demo
* Version:1.0.0  
*/
public class KafkaConsumerTest implements Runnable {

    private final KafkaConsumer<String, String> consumer;
    private ConsumerRecords<String, String> msgList;
    private final String topic;
    private static final String GROUPID = "groupA";

    public KafkaConsumerTest(String topicName) {
        Properties props = new Properties();
        props.put("bootstrap.servers", "master:9092,slave1:9092,slave2:9092");
        props.put("group.id", GROUPID);
        props.put("enable.auto.commit", "true");
        props.put("auto.commit.interval.ms", "1000");
        props.put("session.timeout.ms", "30000");
        props.put("auto.offset.reset", "earliest");
        props.put("key.deserializer", StringDeserializer.class.getName());
        props.put("value.deserializer", StringDeserializer.class.getName());
        this.consumer = new KafkaConsumer<String, String>(props);
        this.topic = topicName;
        this.consumer.subscribe(Arrays.asList(topic));
    }

    @Override
    public void run() {
        int messageNo = 1;
        System.out.println("---------Start consumption---------");
        try {
            for (;;) {
                    msgList = consumer.poll(1000);
                    if(null!=msgList&&msgList.count()>0){
                    for (ConsumerRecord<String, String> record : msgList) {
                        //Print when you consume 100 items, but the printed data does not necessarily follow this rule
                        if(messageNo%100==0){
                            System.out.println(messageNo+"=======receive: key = " + record.key() + ", value = " + record.value()+" offset==="+record.offset());
                        }
                        //Exit when 1000 items are consumed
                        if(messageNo%1000==0){
                            break;
                        }
                        messageNo++;
                    }
                }else{  
                    Thread.sleep(1000);
                }
            }       
        } catch (InterruptedException e) {
            e.printStackTrace();
        } finally {
            consumer.close();
        }
    }  
    public static void main(String args[]) {
        KafkaConsumerTest test1 = new KafkaConsumerTest("KAFKA_TEST");
        Thread thread1 = new Thread(test1);
        thread1.start();
    }
}

Note:   master, slave1 and slave2 are because I have done relationship mapping in my own environment. This can be replaced by the IP of the server.

Of course, I put the project on Github. If you are interested, you can have a look.

summary

Simply developing a kafka program requires the following steps:

  1. The kafka server is successfully set up and started!
  2. Get the kafka service information, and then configure it in the code.
  3. After configuration, listen to the message queue in kafka for messages.
  4. Conduct business logic processing on the generated data!

Tags: Java kafka Distribution

Posted on Mon, 08 Nov 2021 11:27:28 -0500 by PeterPopper