Spark streaming reads the database data extracted from Flume by Kafka and saves it in HBase. Hive maps HBase for query

Recently, the company is working on real-time flow processing. The specific requirements are: real-time import of releva...

Recently, the company is working on real-time flow processing. The specific requirements are: real-time import of relevant data tables in relational databases (MySQL, Oracle) into HBase, and use Hive mapping HBase for data query. The company uses the big data cluster built by CDH6.3.1~

1, Configure Flume to extract data to Kafka

Before that, I wrote an article Flume extracting data from Oracle to Kafka This article describes how to configure the connection to Oracle database in detail. Take MySQL database as an example. In fact, the configuration of the two is the same, just change a few connection strings. There is no explanation here. You can refer to the above chapter. The specific configuration is as follows:

#Declare source, channel, sink a1.channels=c1 a1.sources=r1 a1.sinks=k1 #Declare source type a1.sources.r1.type=org.keedio.flume.source.SQLSource #Declare database links, user names, passwords, etc a1.sources.r1.hibernate.connection.url=jdbc:mysql://ip address: 3306 / database name a1.sources.r1.hibernate.connection.user=username a1.sources.r1.hibernate.connection.password=password #This parameter is very important. It defaults to false. If it is set to false, it will not be submitted automatically a1.sources.r1.hibernate.connection.autocommit=true #Declare the hibernate dialect of MySQL a1.sources.r1.hibernate.dialect=org.hibernate.dialect.MySQL5Dialect #Declare MySQL driver a1.sources.r1.hibernate.connection.driver_class=com.mysql.jdbc.Driver #Query interval in milliseconds a1.sources.r1.run.query.delay=5000 #Declare the folder location where Flume status is saved a1.sources.r1.status.file.path=/root/files/flume #Declare the file name to save Flume status a1.sources.r1.status.file.name=kafka_to_hbase #Declare to query from the first data a1.sources.r1.start.from=0 #Custom sql statement a1.sources.r1.custom.query=select id, name, age, score from student where id > $@$ #Set batch parameters a1.sources.r1.batch.size=1000 a1.sources.r1.max.rows=1000 #Set c3p0 connection pool parameters a1.sources.r1.hibernate.connection.provider_class=org.hibernate.connection.C3P0ConnectionProvider a1.sources.r1.hibernate.c3p0.min_size=1 a1.sources.r1.hibernate.c3p0.max_size=10 #Configure interceptor (replace) a1.sources.r1.interceptors=i1 a1.sources.r1.interceptors.i1.type=search_replace a1.sources.r1.interceptors.i1.searchPattern=" a1.sources.r1.interceptors.i1.replaceString= #Set channel to memory mode a1.channels.c1.type=memory a1.channels.c1.capacity=10000 a1.channels.c1.transactionCapacity=10000 a1.channels.c1.byteCapacityBufferPercentage=20 a1.channels.c1.byteCapacity=800000 #Set the sink type to Kafka a1.sinks.k1.type=org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.topic=test6 a1.sinks.k1.brokerList=cdh-slave01:9092,cdh-slave02:9092,cdh-slave03:9092 a1.sinks.k1.requiredAcks=1 a1.sinks.k1.batchSize=20 #Connect source, channel, sink a1.sinks.k1.channel=c1 a1.sources.r1.channels=c1

From the above, we can see that there is a data table called student in our MySQL, as follows:

mysql> show create table student; +---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | student | CREATE TABLE `student` ( `id` varchar(11) DEFAULT NULL, `name` varchar(20) DEFAULT NULL, `age` varchar(20) DEFAULT NULL, `score` varchar(20) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8 | +---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)
2, SparkStreaming docking Kafka to save data to HBase

First, create an empty Maven project. I won't go into details here. You should all be able to. Next, configure our pom.xml File. In order to match the version of the cluster, you need to determine the version of the relevant components of the cluster. The monitoring interface is shown as follows:

It may be asked here where to look at the version numbers of these components in the monitoring interface, and how to look at them in one step. Here is a description of the editor:

Now that the version of each component has been determined, you can go to the central warehouse to pull the configuration. The configuration of the editor is as follows:

<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.xzw</groupId> <artifactId>spark</artifactId> <version>1.0</version> <repositories> <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> </repositories> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.4.0-cdh6.3.1</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.11</artifactId> <version>2.4.0-cdh6.3.1</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-10_2.11</artifactId> <version>2.4.0-cdh6.3.1</version> </dependency> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>2.2.1-cdh6.3.1</version> </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>2.1.0-cdh6.3.1</version> </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-server</artifactId> <version>2.1.0-cdh6.3.1</version> </dependency> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.12</artifactId> <version>2.2.1-cdh6.3.1</version> </dependency> </dependencies> </project>

Next is the most critical part. Write our code to realize real-time data transmission to HBase. The relevant codes are as follows:

package com.xzw.sparkstreaming import org.apache.hadoop.hbase.client. import org.apache.hadoop.hbase.util.Bytes import org.apache.hadoop.hbase. import org.apache.kafka.common.serialization.StringDeserializer import org.apache.spark.SparkConf import org.apache.spark.internal.Logging import org.apache.spark.streaming.kafka010._ import org.apache.spark.streaming. import scala.util.Try /** * @Description Read the data in Kafka and store it in HBase in real time * @Author wjl * @Date Created by 2020/6/10 9:39 * _ooOoo_ * o8888888o * 88" . "88 * (| -_- |) * O\ = /O * ____/`---'\____ * .' \\| |// `. * / \\||| : |||// \ * / _||||| -:- |||||- \ * | | \\\ - /// | | * | \_| ''\---/'' | | * \ .-\__ `-` ___/-. / * ___`. .' /--.--\ `. . __ * ."" '< `.___\_<|>_/___.' >'"". * | | : `- \`.;`\ _ /`;.`/ - ` : | | * \ \ `-. \_ __\ /__ _/ .-` / / * ======`-.____`-.___\_____/___.-`____.-'====== * `=---=' * ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Buddha bless never BUG */ object KafkaToHBase extends Logging { def main(args: Array[String]): Unit = { //Create streamingcontext val sparkConf = new SparkConf().setAppName("KafkaStreaming2Hbase").setMaster("local[*]") val ssc = new StreamingContext(sparkConf, Seconds(5)) //Defining parameters val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "cdh-slave01:9092,cdh-slave02:9092,cdh-slave03:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "use_a_separate_group_id_for_each_stream", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) //topics val topics = Array("test6") //Create stream val kafkaStream = KafkaUtils.createDirectStream( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](topics, kafkaParams) ) kafkaStream.foreachRDD(rdd => { rdd.foreachPartition(partitions => { //Create in each partition, otherwise there will be serialization exception val table = createTable() try { partitions.foreach(row => { val data: Array[String] = row.value().split(",") print("----------------" + row.value() + "===" + data(0)) val put = new Put(Bytes.toBytes(data(0))) //rowkey put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes(data(1))) put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("age"), Bytes.toBytes(data(2))) put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("score"), Bytes.toBytes(data(3))) Try(table.put(put)).getOrElse(table.close()) //Write data to HBase, and close table in case of error table.close() //Close the connection after the partition data is written to HBase }) } catch { case e: Exception => logError("write in HBase Failed,{}" + e.getMessage) } }) }) //Start SparkStreaming ssc.start() ssc.awaitTermination() } /** * Get HBase table by user defined method * * @return */ def createTable(): Table = { val hbaseConf = HBaseConfiguration.create() hbaseConf.set("hbase.zookeeper.quorum", "cdh-master,cdh-slave01,cdh-slave02") hbaseConf.set("hbase.zookeeper.property.clientPort", "2181") hbaseConf.set("hbase.defaults.for.version.skip", "true") val conn = ConnectionFactory.createConnection(hbaseConf) conn.getTable(TableName.valueOf("ljs:student")) } }
3, Hive map HBase

1. First, create related tables in HBase and insert several pieces of data

--Create namespace create_namespace 'ljs' --Create table create 'ljs:student','info' --View table structure !desc 'ljs:student' --insert data put 'ljs:student','1','info:name','zs' put 'ljs:student','1','info:age','20' put 'ljs:student','1','info:score','90' put 'ljs:student','2','info:name','ls' put 'ljs:student','2','info:age','21' put 'ljs:student','2','info:score','95' put 'ljs:student','3','info:name','ww' put 'ljs:student','3','info:age','22' put 'ljs:student','3','info:score','99'

2. Create a new Hive external table and map HBase table

--newly build Hive surface create external table ljs_student( user_no string, user_name string, user_age string, user_score string ) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties ("hbase.columns.mapping" = ":key,info:name,info:age,info:score") tblproperties("hbase.table.name" = "ljs:student");

3. Query the data in Hive to see if the mapping is successful

hive> select * from ljs_student; OK 1 zs 20 90 2 ls 21 95 3 ww 22 99 Time taken: 0.079 seconds, Fetched: 3 row(s)
4, Testing

1. Create a new test6 topic.

kafka-topics --zookeeper cdh-master:2181,cdh-slave01:2181,cdh-slave02:2181 --create --replication-factor 1 --partitions 3 --topic test6

2. Start Flume

nohup bin/flume-ng agent --name a1 --conf ./conf --conf-file ./conf/kafka-to-hbase.properties -Dflume.root.logger=INFO,console &

3. Start Spark locally for testing

Insert several pieces of data into MySQL:

insert into student values('5000','test','24','24'); insert into student values('6000','test','23','84');

You can see that the console has the following output:

4. By querying Hive, we can find that the new data has been added to HBase in real time

hive> select * from ljs_student; OK 1 zs 20 90 2 ls 21 95 3 ww 22 99 5000 test 24 24 6000 test 23 84 Time taken: 0.133 seconds, Fetched: 5 row(s)

5. After the local test is successful, the code can be packaged and uploaded to the cluster. The start command in the cluster is:

spark-submit --class com.xzw.sparkstreaming.KafkaToHBase \ --master local[*] --num-executors 1 --driver-memory 1g \ --driver-cores 1 --executor-memory 1g --executor-cores 1 \ spark.jar

Here someone will ask, how to make jar package in IDEA? To be honest, it's really a headache. Let's talk about two methods that are often used by small editors. One is to use maven to directly hit jar packages, which should also be a method that we often use. Here's another packing method. Recently, Xiaobian loves this method.

(1) first, file -- > project Structure.

(2) then

(3) select main category

(4) click Apply - > OK

(5)Build

You can see the jar package in the left directory

This is the end of the article. Please leave a message and let me see what problems you have encountered in the process~

10 June 2020, 00:55 | Views: 6998

Add new comment

For adding a comment, please log in
or create account

0 comments