Spark streaming reads the database data extracted from Flume by Kafka and saves it in HBase. Hive maps HBase for query

Recently, the company is working on real-time flow processing. The specific requirements are: real-time import of relevant data tables in relational databases (MySQL, Oracle) into HBase, and use Hive mapping HBase for data query. The company uses the big data cluster built by CDH6.3.1~

1, Configure Flume to extract data to Kafka

Before that, I wrote an article Flume extracting data from Oracle to Kafka This article describes how to configure the connection to Oracle database in detail. Take MySQL database as an example. In fact, the configuration of the two is the same, just change a few connection strings. There is no explanation here. You can refer to the above chapter. The specific configuration is as follows:

#Declare source, channel, sink

#Declare source type

#Declare database links, user names, passwords, etc
a1.sources.r1.hibernate.connection.url=jdbc:mysql://ip address: 3306 / database name

#This parameter is very important. It defaults to false. If it is set to false, it will not be submitted automatically
#Declare the hibernate dialect of MySQL
#Declare MySQL driver

#Query interval in milliseconds
#Declare the folder location where Flume status is saved
#Declare the file name to save Flume status
#Declare to query from the first data
#Custom sql statement
a1.sources.r1.custom.query=select id, name, age, score from student where id > $@$

#Set batch parameters

#Set c3p0 connection pool parameters

#Configure interceptor (replace)
#Set channel to memory mode
#Set the sink type to Kafka

#Connect source, channel, sink

From the above, we can see that there is a data table called student in our MySQL, as follows:

mysql> show create table student;
| Table   | Create Table                                                                                                                                                                                           |
| student | CREATE TABLE `student` (
  `id` varchar(11) DEFAULT NULL,
  `name` varchar(20) DEFAULT NULL,
  `age` varchar(20) DEFAULT NULL,
  `score` varchar(20) DEFAULT NULL
1 row in set (0.00 sec)

2, SparkStreaming docking Kafka to save data to HBase

First, create an empty Maven project. I won't go into details here. You should all be able to. Next, configure our pom.xml File. In order to match the version of the cluster, you need to determine the version of the relevant components of the cluster. The monitoring interface is shown as follows:

It may be asked here where to look at the version numbers of these components in the monitoring interface, and how to look at them in one step. Here is a description of the editor:

Now that the version of each component has been determined, you can go to the central warehouse to pull the configuration. The configuration of the editor is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns=""












Next is the most critical part. Write our code to realize real-time data transmission to HBase. The relevant codes are as follows:

package com.xzw.sparkstreaming

import org.apache.hadoop.hbase.client.{ConnectionFactory, Put, Table}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.internal.Logging
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.util.Try

 * @Description Read the data in Kafka and store it in HBase in real time
 * @Author wjl
 * @Date Created by 2020/6/10 9:39
 *                             _ooOoo_
 *                            o8888888o
 *                            88" . "88
 *                            (| -_- |)
 *                            O\  =  /O
 *                         ____/`---'\____
 *                       .'  \\|     |//  `.
 *                      /  \\|||  :  |||//  \
 *                     /  _||||| -:- |||||-  \
 *                     |   | \\\  -  /// |   |
 *                     | \_|  ''\---/''  |   |
 *                     \  .-\__  `-`  ___/-. /
 *                   ___`. .'  /--.--\  `. . __
 *                ."" '<  `.___\_<|>_/___.'  >'"".
 *               | | :  `- \`.;`\ _ /`;.`/ - ` : | |
 *               \  \ `-.   \_ __\ /__ _/   .-` /  /
 *          ======`-.____`-.___\_____/___.-`____.-'======
 *                             `=---='
 *          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 *                     Buddha bless never BUG
object KafkaToHBase extends Logging {

  def main(args: Array[String]): Unit = {

    //Create streamingcontext
    val sparkConf = new SparkConf().setAppName("KafkaStreaming2Hbase").setMaster("local[*]")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    //Defining parameters
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "cdh-slave01:9092,cdh-slave02:9092,cdh-slave03:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "latest",
      "" -> (false: java.lang.Boolean)

    val topics = Array("test6")
    //Create stream
    val kafkaStream = KafkaUtils.createDirectStream(
      ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)

    kafkaStream.foreachRDD(rdd => {
      rdd.foreachPartition(partitions => {
        //Create in each partition, otherwise there will be serialization exception
        val table = createTable()
        try {
          partitions.foreach(row => {
            val data: Array[String] = row.value().split(",")
            print("----------------" + row.value() + "===" + data(0))

            val put = new Put(Bytes.toBytes(data(0))) //rowkey
            put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes(data(1)))
            put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("age"), Bytes.toBytes(data(2)))
            put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("score"), Bytes.toBytes(data(3)))
            Try(table.put(put)).getOrElse(table.close()) //Write data to HBase, and close table in case of error
            table.close() //Close the connection after the partition data is written to HBase
        } catch {
          case e: Exception => logError("write in HBase Failed,{}" + e.getMessage)

    //Start SparkStreaming

   * Get HBase table by user defined method
   * @return
  def createTable(): Table = {
    val hbaseConf = HBaseConfiguration.create()
    hbaseConf.set("hbase.zookeeper.quorum", "cdh-master,cdh-slave01,cdh-slave02")
    hbaseConf.set("", "2181")
    hbaseConf.set("hbase.defaults.for.version.skip", "true")
    val conn = ConnectionFactory.createConnection(hbaseConf)


3, Hive map HBase

1. First, create related tables in HBase and insert several pieces of data

--Create namespace
create_namespace 'ljs'
--Create table
create 'ljs:student','info'
--View table structure
!desc 'ljs:student'
--insert data
put 'ljs:student','1','info:name','zs'
put 'ljs:student','1','info:age','20'
put 'ljs:student','1','info:score','90'
put 'ljs:student','2','info:name','ls'
put 'ljs:student','2','info:age','21'
put 'ljs:student','2','info:score','95'
put 'ljs:student','3','info:name','ww'
put 'ljs:student','3','info:age','22'
put 'ljs:student','3','info:score','99'

2. Create a new Hive external table and map HBase table

--newly build Hive surface
create external table ljs_student(
user_no string,
user_name string,
user_age string,
user_score string
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
with serdeproperties ("hbase.columns.mapping" = ":key,info:name,info:age,info:score")
tblproperties("" = "ljs:student");

3. Query the data in Hive to see if the mapping is successful

hive> select * from ljs_student;
1	zs	20	90
2	ls	21	95
3	ww	22	99
Time taken: 0.079 seconds, Fetched: 3 row(s)

4, Testing

1. Create a new test6 topic.

kafka-topics --zookeeper cdh-master:2181,cdh-slave01:2181,cdh-slave02:2181 --create --replication-factor 1 --partitions 3 --topic test6

2. Start Flume

nohup bin/flume-ng agent --name a1 --conf ./conf --conf-file ./conf/ -Dflume.root.logger=INFO,console &

3. Start Spark locally for testing

Insert several pieces of data into MySQL:

insert into student values('5000','test','24','24');
insert into student values('6000','test','23','84');

You can see that the console has the following output:

4. By querying Hive, we can find that the new data has been added to HBase in real time

hive> select * from ljs_student;
1	zs	20	90
2	ls	21	95
3	ww	22	99
5000	test	24	24
6000	test	23	84
Time taken: 0.133 seconds, Fetched: 5 row(s)

5. After the local test is successful, the code can be packaged and uploaded to the cluster. The start command in the cluster is:

spark-submit --class com.xzw.sparkstreaming.KafkaToHBase \
--master local[*] --num-executors 1 --driver-memory 1g \
--driver-cores 1 --executor-memory 1g --executor-cores 1 \

Here someone will ask, how to make jar package in IDEA? To be honest, it's really a headache. Let's talk about two methods that are often used by small editors. One is to use maven to directly hit jar packages, which should also be a method that we often use. Here's another packing method. Recently, Xiaobian loves this method.

(1) first, file -- > project Structure.

(2) then

(3) select main category

(4) click Apply - > OK


You can see the jar package in the left directory


This is the end of the article. Please leave a message and let me see what problems you have encountered in the process~

Tags: HBase Apache Spark kafka

Posted on Wed, 10 Jun 2020 00:55:16 -0400 by jcleary