Spark streaming reads the database data extracted from Flume by Kafka and saves it in HBase. Hive maps HBase for query

Recently, the company is working on real-time flow processing. The specific requirements are: real-time import of relevant data tables in relational databases (MySQL, Oracle) into HBase, and use Hive mapping HBase for data query. The company uses the big data cluster built by CDH6.3.1~

1, Configure Flume to extract data to Kafka

Before that, I wrote an article Flume extracting data from Oracle to Kafka This article describes how to configure the connection to Oracle database in detail. Take MySQL database as an example. In fact, the configuration of the two is the same, just change a few connection strings. There is no explanation here. You can refer to the above chapter. The specific configuration is as follows:

#Declare source, channel, sink
a1.channels=c1
a1.sources=r1
a1.sinks=k1

#Declare source type
a1.sources.r1.type=org.keedio.flume.source.SQLSource

#Declare database links, user names, passwords, etc
a1.sources.r1.hibernate.connection.url=jdbc:mysql://ip address: 3306 / database name
a1.sources.r1.hibernate.connection.user=username
a1.sources.r1.hibernate.connection.password=password

#This parameter is very important. It defaults to false. If it is set to false, it will not be submitted automatically
a1.sources.r1.hibernate.connection.autocommit=true
#Declare the hibernate dialect of MySQL
a1.sources.r1.hibernate.dialect=org.hibernate.dialect.MySQL5Dialect
#Declare MySQL driver
a1.sources.r1.hibernate.connection.driver_class=com.mysql.jdbc.Driver

#Query interval in milliseconds
a1.sources.r1.run.query.delay=5000
#Declare the folder location where Flume status is saved
a1.sources.r1.status.file.path=/root/files/flume
#Declare the file name to save Flume status
a1.sources.r1.status.file.name=kafka_to_hbase
#Declare to query from the first data
a1.sources.r1.start.from=0
#Custom sql statement
a1.sources.r1.custom.query=select id, name, age, score from student where id > $@$

#Set batch parameters
a1.sources.r1.batch.size=1000
a1.sources.r1.max.rows=1000

#Set c3p0 connection pool parameters
a1.sources.r1.hibernate.connection.provider_class=org.hibernate.connection.C3P0ConnectionProvider
a1.sources.r1.hibernate.c3p0.min_size=1
a1.sources.r1.hibernate.c3p0.max_size=10

#Configure interceptor (replace)
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=search_replace
a1.sources.r1.interceptors.i1.searchPattern="
a1.sources.r1.interceptors.i1.replaceString=
 
#Set channel to memory mode
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=10000
a1.channels.c1.byteCapacityBufferPercentage=20
a1.channels.c1.byteCapacity=800000
 
#Set the sink type to Kafka
a1.sinks.k1.type=org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic=test6
a1.sinks.k1.brokerList=cdh-slave01:9092,cdh-slave02:9092,cdh-slave03:9092
a1.sinks.k1.requiredAcks=1
a1.sinks.k1.batchSize=20

#Connect source, channel, sink
a1.sinks.k1.channel=c1
a1.sources.r1.channels=c1

From the above, we can see that there is a data table called student in our MySQL, as follows:

mysql> show create table student;
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table   | Create Table                                                                                                                                                                                           |
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| student | CREATE TABLE `student` (
  `id` varchar(11) DEFAULT NULL,
  `name` varchar(20) DEFAULT NULL,
  `age` varchar(20) DEFAULT NULL,
  `score` varchar(20) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

2, SparkStreaming docking Kafka to save data to HBase

First, create an empty Maven project. I won't go into details here. You should all be able to. Next, configure our pom.xml File. In order to match the version of the cluster, you need to determine the version of the relevant components of the cluster. The monitoring interface is shown as follows:

It may be asked here where to look at the version numbers of these components in the monitoring interface, and how to look at them in one step. Here is a description of the editor:

Now that the version of each component has been determined, you can go to the central warehouse to pull the configuration. The configuration of the editor is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.xzw</groupId>
    <artifactId>spark</artifactId>
    <version>1.0</version>

    <repositories>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
    </repositories>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.0-cdh6.3.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>2.4.0-cdh6.3.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>2.4.0-cdh6.3.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>2.2.1-cdh6.3.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>2.1.0-cdh6.3.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>2.1.0-cdh6.3.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka_2.12</artifactId>
            <version>2.2.1-cdh6.3.1</version>
        </dependency>

    </dependencies>

</project>

Next is the most critical part. Write our code to realize real-time data transmission to HBase. The relevant codes are as follows:

package com.xzw.sparkstreaming

import org.apache.hadoop.hbase.client.{ConnectionFactory, Put, Table}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.internal.Logging
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.util.Try

/**
 * @Description Read the data in Kafka and store it in HBase in real time
 * @Author wjl
 * @Date Created by 2020/6/10 9:39
 *                             _ooOoo_
 *                            o8888888o
 *                            88" . "88
 *                            (| -_- |)
 *                            O\  =  /O
 *                         ____/`---'\____
 *                       .'  \\|     |//  `.
 *                      /  \\|||  :  |||//  \
 *                     /  _||||| -:- |||||-  \
 *                     |   | \\\  -  /// |   |
 *                     | \_|  ''\---/''  |   |
 *                     \  .-\__  `-`  ___/-. /
 *                   ___`. .'  /--.--\  `. . __
 *                ."" '<  `.___\_<|>_/___.'  >'"".
 *               | | :  `- \`.;`\ _ /`;.`/ - ` : | |
 *               \  \ `-.   \_ __\ /__ _/   .-` /  /
 *          ======`-.____`-.___\_____/___.-`____.-'======
 *                             `=---='
 *          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 *                     Buddha bless never BUG
 */
object KafkaToHBase extends Logging {

  def main(args: Array[String]): Unit = {

    //Create streamingcontext
    val sparkConf = new SparkConf().setAppName("KafkaStreaming2Hbase").setMaster("local[*]")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    //Defining parameters
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "cdh-slave01:9092,cdh-slave02:9092,cdh-slave03:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )

    //topics
    val topics = Array("test6")
    //Create stream
    val kafkaStream = KafkaUtils.createDirectStream(
      ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
    )

    kafkaStream.foreachRDD(rdd => {
      rdd.foreachPartition(partitions => {
        //Create in each partition, otherwise there will be serialization exception
        val table = createTable()
        try {
          partitions.foreach(row => {
            val data: Array[String] = row.value().split(",")
            print("----------------" + row.value() + "===" + data(0))

            val put = new Put(Bytes.toBytes(data(0))) //rowkey
            put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes(data(1)))
            put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("age"), Bytes.toBytes(data(2)))
            put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("score"), Bytes.toBytes(data(3)))
            Try(table.put(put)).getOrElse(table.close()) //Write data to HBase, and close table in case of error
            table.close() //Close the connection after the partition data is written to HBase
          })
        } catch {
          case e: Exception => logError("write in HBase Failed,{}" + e.getMessage)
        }
      })
    })

    //Start SparkStreaming
    ssc.start()
    ssc.awaitTermination()
  }

  /**
   * Get HBase table by user defined method
   *
   * @return
   */
  def createTable(): Table = {
    val hbaseConf = HBaseConfiguration.create()
    hbaseConf.set("hbase.zookeeper.quorum", "cdh-master,cdh-slave01,cdh-slave02")
    hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")
    hbaseConf.set("hbase.defaults.for.version.skip", "true")
    val conn = ConnectionFactory.createConnection(hbaseConf)
    conn.getTable(TableName.valueOf("ljs:student"))
  }

}

3, Hive map HBase

1. First, create related tables in HBase and insert several pieces of data

--Create namespace
create_namespace 'ljs'
--Create table
create 'ljs:student','info'
--View table structure
!desc 'ljs:student'
--insert data
put 'ljs:student','1','info:name','zs'
put 'ljs:student','1','info:age','20'
put 'ljs:student','1','info:score','90'
put 'ljs:student','2','info:name','ls'
put 'ljs:student','2','info:age','21'
put 'ljs:student','2','info:score','95'
put 'ljs:student','3','info:name','ww'
put 'ljs:student','3','info:age','22'
put 'ljs:student','3','info:score','99'

2. Create a new Hive external table and map HBase table

--newly build Hive surface
create external table ljs_student(
user_no string,
user_name string,
user_age string,
user_score string
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
with serdeproperties ("hbase.columns.mapping" = ":key,info:name,info:age,info:score")
tblproperties("hbase.table.name" = "ljs:student");

3. Query the data in Hive to see if the mapping is successful

hive> select * from ljs_student;
OK
1	zs	20	90
2	ls	21	95
3	ww	22	99
Time taken: 0.079 seconds, Fetched: 3 row(s)

4, Testing

1. Create a new test6 topic.

kafka-topics --zookeeper cdh-master:2181,cdh-slave01:2181,cdh-slave02:2181 --create --replication-factor 1 --partitions 3 --topic test6

2. Start Flume

nohup bin/flume-ng agent --name a1 --conf ./conf --conf-file ./conf/kafka-to-hbase.properties -Dflume.root.logger=INFO,console &

3. Start Spark locally for testing

Insert several pieces of data into MySQL:

insert into student values('5000','test','24','24');
insert into student values('6000','test','23','84');

You can see that the console has the following output:

4. By querying Hive, we can find that the new data has been added to HBase in real time

hive> select * from ljs_student;
OK
1	zs	20	90
2	ls	21	95
3	ww	22	99
5000	test	24	24
6000	test	23	84
Time taken: 0.133 seconds, Fetched: 5 row(s)

5. After the local test is successful, the code can be packaged and uploaded to the cluster. The start command in the cluster is:

spark-submit --class com.xzw.sparkstreaming.KafkaToHBase \
--master local[*] --num-executors 1 --driver-memory 1g \
--driver-cores 1 --executor-memory 1g --executor-cores 1 \
spark.jar

Here someone will ask, how to make jar package in IDEA? To be honest, it's really a headache. Let's talk about two methods that are often used by small editors. One is to use maven to directly hit jar packages, which should also be a method that we often use. Here's another packing method. Recently, Xiaobian loves this method.

(1) first, file -- > project Structure.

(2) then

(3) select main category

(4) click Apply - > OK

       (5)Build

You can see the jar package in the left directory

 

This is the end of the article. Please leave a message and let me see what problems you have encountered in the process~

Tags: HBase Apache Spark kafka

Posted on Wed, 10 Jun 2020 00:55:16 -0400 by jcleary