Kafka data source generator

1, Preparations

  • Data source: From Alibaba cloud Tianchi public data set Or in Github download

  • Create Topic: user "behavior

    $ bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic user_behavior
    WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
    
    $ bin/kafka-topics.sh --list --bootstrap-server localhost:9092
    test
    user_behavior
    

2, Generator code

Reference resources: SourceGenerator

Java code: MockSourceGenerator

public class MockSourceGenerator {
    private static final long SPEED = 10; // Default 10 hecg s per second
    public static void main(String[] args) {
        long speed = SPEED;
        if (args.length > 0) {
            speed = Long.valueOf(args[0]);
        }
        long delay = 1000_000 / speed; // How many milliseconds does each take

        // Read the dataset above, in behavioral units
        try (InputStream inputStream = MockSourceGenerator.class.getClassLoader().getResourceAsStream("user_behavior.log")) {
            BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
            long start = System.nanoTime();
            while (reader.ready()) {
                String line = reader.readLine();
                System.out.println(line);

                long end = System.nanoTime();
                long diff = end - start;
                while (diff < (delay*1000)) {
                    Thread.sleep(1);
                    end = System.nanoTime();
                    diff = end - start;
                }
                start = end;
            }
            reader.close();
        } catch (IOException e) {
            throw new RuntimeException(e);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

}

Compile package, command line test: the following parameters indicate how many pieces of data are output per second

$ java -cp target/java-flink-1.0-SNAPSHOT.jar cn.rumoss.study.flink.MockSourceGenerator 1
{"user_id": "543462", "item_id":"1715", "category_id": "1464116", "behavior": "pv", "ts": "2017-11-26T01:00:00Z"}
{"user_id": "662867", "item_id":"2244074", "category_id": "1575622", "behavior": "pv", "ts": "2017-11-26T01:00:00Z"}
{"user_id": "561558", "item_id":"3611281", "category_id": "965809", "behavior": "pv", "ts": "2017-11-26T01:00:00Z"}
...

3, Using pipes, dropping data into Kafka

Copy the Jar package above to the Kafka root directory:

  • Production data to Topic:

    $ java -cp java-flink-1.0-SNAPSHOT.jar cn.rumoss.study.flink.MockSourceGenerator 1 | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic user_behavior
    >>>...
    
  • When subscribing to the data put in by Topic consumption, you can see that there are data coming in one after another:

    $ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic user_behavior --from-beginning
    {"user_id": "543462", "item_id":"1715", "category_id": "1464116", "behavior": "pv", "ts": "2017-11-26T01:00:00Z"}
    {"user_id": "662867", "item_id":"2244074", "category_id": "1575622", "behavior": "pv", "ts": "2017-11-26T01:00:00Z"}
    {"user_id": "561558", "item_id":"3611281", "category_id": "965809", "behavior": "pv", "ts": "2017-11-26T01:00:00Z"}
    ...
    

Tags: Programming kafka Java snapshot github

Posted on Mon, 20 Jan 2020 11:59:57 -0500 by mattison