Several ways of reading and writing spark articles by HBase

1. Overview of how HBase is read and written

It is mainly divided into:

  1. The way pure Java API s read and write HBase s;
  2. How Spark reads and writes HBase s;
  3. How Flink reads and writes HBase s;
  4. HBase is read and written through Phoenix;

The first is the more original and efficient operation mode provided by HBase itself. The second and third are the way Spark and Flink integrate HBase. The last is the way of JDBC integrated by third-party plug-in Phoenix. The way of JDBC integrated by Phoenix can also be called in Spark and Flink.

Be careful:

Here we are using HBase version 2.1.2, spark version 2.4, scala-2.12, based on which the following code was developed.

2. Read and write HBase s on Spark

Read and write HBases on Spark are divided into old and new API s, and there are also HBases that are inserted in batches and operated on by Phoenix.

2.1 spark reads and writes new and old API s for HBase

2.1.1 spark writes data to HBase

Save data to HBase using an older version of saveAsHadoopDataset.

/**
 * saveAsHadoopDataset
 */
def writeToHBase(): Unit ={
  // Blocking unnecessary logs from displaying on terminals
  Logger.getLogger("org.apache.spark").setLevel(Level.WARN)

  /* spark2.0 Previous Writing
  val conf = new SparkConf().setAppName("SparkToHBase").setMaster("local")
  val sc = new SparkContext(conf)
  */
  val sparkSession = SparkSession.builder().appName("SparkToHBase").master("local[4]").getOrCreate()
  val sc = sparkSession.sparkContext

  val tableName = "test"

  //Establish HBase To configure
  val hbaseConf = HBaseConfiguration.create()
  hbaseConf.set(HConstants.ZOOKEEPER_QUORUM, "192.168.187.201") //Set up zookeeper Clusters can also be created by adding hbase-site.xml Import classpath,However, it is recommended that this be set in the program
  hbaseConf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181") //Set up zookeeper Connection port, default 2181
  hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, tableName)

  //Initialization job,Set the output format, TableOutputFormat yes org.apache.hadoop.hbase.mapred Unpackaged
  val jobConf = new JobConf(hbaseConf)
  jobConf.setOutputFormat(classOf[TableOutputFormat])

  val dataRDD = sc.makeRDD(Array("12,jack,16", "11,Lucy,15", "15,mike,17", "13,Lily,14"))

  val data = dataRDD.map{ item =>
      val Array(key, name, age) = item.split(",")
      val rowKey = key.reverse
      val put = new Put(Bytes.toBytes(rowKey))
      /*A Put object is a row of records that specify the primary key in the construction method
       * All inserted data must be converted to org.apache.hadoop.hbase.util.Bytes.toBytes
       * Put.addColumn Method takes three parameters: column family, column name, data*/
      put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"), Bytes.toBytes(name))
      put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("age"), Bytes.toBytes(age))
      (new ImmutableBytesWritable(), put)
  }
  //Save to HBase surface
  data.saveAsHadoopDataset(jobConf)
  sparkSession.stop()
}

Save data to HBase using a new version of saveAsNewAPI HadoopDataset

The contents of a.txt file are:

100,hello,20
101,nice,24
102,beautiful,26
/**
 * saveAsNewAPIHadoopDataset
 */
 def writeToHBaseNewAPI(): Unit ={
   // Blocking unnecessary logs from displaying on terminals
   Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
   val sparkSession = SparkSession.builder().appName("SparkToHBase").master("local[4]").getOrCreate()
   val sc = sparkSession.sparkContext

   val tableName = "test"
   val hbaseConf = HBaseConfiguration.create()
   hbaseConf.set(HConstants.ZOOKEEPER_QUORUM, "192.168.187.201")
   hbaseConf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181")
   hbaseConf.set(org.apache.hadoop.hbase.mapreduce.TableOutputFormat.OUTPUT_TABLE, tableName)

   val jobConf = new JobConf(hbaseConf)
   //Set up job Output format of
   val job = Job.getInstance(jobConf)
   job.setOutputKeyClass(classOf[ImmutableBytesWritable])
   job.setOutputValueClass(classOf[Result])
   job.setOutputFormatClass(classOf[org.apache.hadoop.hbase.mapreduce.TableOutputFormat[ImmutableBytesWritable]])

   val input = sc.textFile("v2120/a.txt")

   val data = input.map{item =>
   val Array(key, name, age) = item.split(",")
   val rowKey = key.reverse
   val put = new Put(Bytes.toBytes(rowKey))
   put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"), Bytes.toBytes(name))
   put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("age"), Bytes.toBytes(age))
   (new ImmutableBytesWritable, put)
   }
   //Save to HBase surface
   data.saveAsNewAPIHadoopDataset(job.getConfiguration)
   sparkSession.stop()
}

2.1.2 spark reads data from HBase

Using the new API HadoopRDD to read data from hbase, you can filter data through scan

/**
 * scan
 */
 def readFromHBaseWithHBaseNewAPIScan(): Unit ={
   //Blocking unnecessary logs from displaying on terminals
   Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
   val sparkSession = SparkSession.builder().appName("SparkToHBase").master("local").getOrCreate()
   val sc = sparkSession.sparkContext

   val tableName = "test"
   val hbaseConf = HBaseConfiguration.create()
   hbaseConf.set(HConstants.ZOOKEEPER_QUORUM, "192.168.187.201")
   hbaseConf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181")
   hbaseConf.set(org.apache.hadoop.hbase.mapreduce.TableInputFormat.INPUT_TABLE, tableName)

   val scan = new Scan()
   scan.addFamily(Bytes.toBytes("cf1"))
   val proto = ProtobufUtil.toScan(scan)
   val scanToString = new String(Base64.getEncoder.encode(proto.toByteArray))
   hbaseConf.set(org.apache.hadoop.hbase.mapreduce.TableInputFormat.SCAN, scanToString)

   //Read data and convert to rdd TableInputFormat yes org.apache.hadoop.hbase.mapreduce Unpackaged
   val hbaseRDD = sc.newAPIHadoopRDD(hbaseConf, classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])

   val dataRDD = hbaseRDD
     .map(x => x._2)
     .map{result =>
       (result.getRow, result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("name")), result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("age")))
     }.map(row => (new String(row._1), new String(row._2), new String(row._3)))
     .collect()
     .foreach(r => (println("rowKey:"+r._1 + ", name:" + r._2 + ", age:" + r._3)))
}

2.2 spark uses BulkLoad to bulk insert data into HBase

BulkLoad principle is to use mapreduce to generate corresponding HFlie files on hdfs and then import the HFile files into HBase to achieve efficient bulk insertion of data.

/**
 * Insert multiple columns in batch
 */
 def insertWithBulkLoadWithMulti(): Unit ={

   val sparkSession = SparkSession.builder().appName("insertWithBulkLoad").master("local[4]").getOrCreate()
   val sc = sparkSession.sparkContext

   val tableName = "test"
   val hbaseConf = HBaseConfiguration.create()
   hbaseConf.set(HConstants.ZOOKEEPER_QUORUM, "192.168.187.201")
   hbaseConf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181")
   hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, tableName)

   val conn = ConnectionFactory.createConnection(hbaseConf)
   val admin = conn.getAdmin
   val table = conn.getTable(TableName.valueOf(tableName))

   val job = Job.getInstance(hbaseConf)
   //Set up job Output format of
   job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
   job.setMapOutputValueClass(classOf[KeyValue])
   job.setOutputFormatClass(classOf[HFileOutputFormat2])
   HFileOutputFormat2.configureIncrementalLoad(job, table, conn.getRegionLocator(TableName.valueOf(tableName)))

   val rdd = sc.textFile("v2120/a.txt")
     .map(_.split(","))
     .map(x => (DigestUtils.md5Hex(x(0)).substring(0, 3) + x(0), x(1), x(2)))
     .sortBy(_._1)
     .flatMap(x =>
       {
         val listBuffer = new ListBuffer[(ImmutableBytesWritable, KeyValue)]
         val kv1: KeyValue = new KeyValue(Bytes.toBytes(x._1), Bytes.toBytes("cf1"), Bytes.toBytes("name"), Bytes.toBytes(x._2 + ""))
         val kv2: KeyValue = new KeyValue(Bytes.toBytes(x._1), Bytes.toBytes("cf1"), Bytes.toBytes("age"), Bytes.toBytes(x._3 + ""))
         listBuffer.append((new ImmutableBytesWritable, kv2))
         listBuffer.append((new ImmutableBytesWritable, kv1))
         listBuffer
       }
     )
   //Sort columns by column name alphabet size
   
   isFileExist("hdfs://node1:9000/test", sc)

   rdd.saveAsNewAPIHadoopFile("hdfs://node1:9000/test", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], job.getConfiguration)
   val bulkLoader = new LoadIncrementalHFiles(hbaseConf)
   bulkLoader.doBulkLoad(new Path("hdfs://node1:9000/test"), admin, table, conn.getRegionLocator(TableName.valueOf(tableName)))
}

/**
 * Determine if a file exists on hdfs, delete if it exists
 */
def isFileExist(filePath: String, sc: SparkContext): Unit ={
  val output = new Path(filePath)
  val hdfs = FileSystem.get(new URI(filePath), new Configuration)
  if (hdfs.exists(output)){
    hdfs.delete(output, true)
  }
}

2.3 spark uses Phoenix to read and write data to HBase

With Phoenix, as with relational databases like msyql, you need to write jdbc

def readFromHBaseWithPhoenix: Unit ={
   //Blocking unnecessary logs from displaying on terminals
   Logger.getLogger("org.apache.spark").setLevel(Level.WARN)

   val sparkSession = SparkSession.builder().appName("SparkHBaseDataFrame").master("local[4]").getOrCreate()

   //Table lowercase, double quotation marks are required, otherwise error is reported
   val dbTable = "\"test\""

   //spark read phoenix Return DataFrame The first way
   val rdf = sparkSession.read
     .format("jdbc")
     .option("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
     .option("url", "jdbc:phoenix:192.168.187.201:2181")
     .option("dbtable", dbTable)
     .load()

   val rdfList = rdf.collect()
   for (i <- rdfList){
     println(i.getString(0) + " " + i.getString(1) + " " + i.getString(2))
   }
   rdf.printSchema()

   //spark read phoenix Return DataFrame Second way
   val df = sparkSession.read
     .format("org.apache.phoenix.spark")
     .options(Map("table" -> dbTable, "zkUrl" -> "192.168.187.201:2181"))
     .load()
   df.printSchema()
   val dfList = df.collect()
   for (i <- dfList){
      println(i.getString(0) + " " + i.getString(1) + " " + i.getString(2))
   }
   //spark DataFrame Write in phoenix,Table needs to be built first
   /*df.write
     .format("org.apache.phoenix.spark")
     .mode(SaveMode.Overwrite)
     .options(Map("table" -> "PHOENIXTESTCOPY", "zkUrl" -> "jdbc:phoenix:192.168.187.201:2181"))
     .save()
*/
   sparkSession.stop()
}

3. Summary

Several ways of HBase connection (1) java articles You can view pure Java API s for reading and writing HBase s

Several ways of reading and writing HBase (3) flink articles You can view flink read and write HBase s

[github address]

https://github.com/SwordfallYeung/HBaseDemo

[Reference]

https://my.oschina.net/uchihamadara/blog/2032481

https://www.cnblogs.com/simple-focus/p/6879971.html

https://www.cnblogs.com/MOBIN/p/5559575.html

https://blog.csdn.net/Suubyy/article/details/80892023

https://www.jianshu.com/p/b09283b14d84

https://www.jianshu.com/p/8e3fdf70dc06

https://www.cnblogs.com/wumingcong/p/6044038.html

https://blog.csdn.net/zhuyu_deng/article/details/43192271

https://www.jianshu.com/p/4c908e419b60

https://blog.csdn.net/Colton_Null/article/details/83387995

https://www.jianshu.com/p/b09283b14d84

https://cloud.tencent.com/developer/article/1189464

https://blog.bcmeng.com/post/hbase-bulkload.html The HDFS cluster used by the Hive data source and the HDFS cluster used by the HBase table are not the same cluster practices

Tags: HBase Spark Apache Hadoop

Posted on Sat, 09 May 2020 23:05:48 -0400 by israfel