How to integrate Hive and HBase

Version Description:

HDP: 3.0.1.0

Hive: 3.1.0

HBase: 2.0.0

I. Preface

Before learning HBase, we had doubts. Although HBase can store hundreds of millions or billions of rows of data, it is not very friendly for data analysis. It only provides a simple quick query ability based on Key values, and cannot perform a large number of conditional queries.

However, the integration of Hive and HBase can achieve this goal. Not only that, but also batch import data into HBase through Hive.

<!--more-->

The integration of Hive and HBase is realized by using their own external API interfaces to communicate with each other. The specific work is handed over to Hive HBase handler xxx.jar tool class in Hive's lib directory to read HBase data.

2, Applicable scenarios

Applicable scenarios of Hive and HBase integration:

**1. * * through Hive and HBase integration, HBase data can be analyzed through Hive, so that HBase supports SQL query syntax such as JOIN and GROUP.

**2. * to import batch data into HBase table.

3, Dependent conditions

We need to rely on the following, ambari has done all this for us:

  • There are HDFS, MapReduce, Hive, Zookeeper and HBase environments.
  • Make sure that hive-hbase-handler-xxx.jar, Zookeeper jar, HBase Server jar and HBase Client jar are available in Hive's lib directory.

4, Using HBase Hive integration

Note that this is different from HDP 2. X: the change to Hive-3.1.0 in HDP 3.0 is that all storagehandlers must be marked "external" and there are no non external tables created by storagehandlers. If the corresponding HBase table exists when the Hive table is created, it mimics the HDP 2.x semantics of the "external" table. If the corresponding HBase table does not exist when the hive table is created, it will mimic the HDP 2. X semantics of non external tables.

Summary: external tables should be used in Hive to associate with Hbase tables no matter whether the Hbase table exists or not. If the associated Hbase table does not exist, Hive will automatically create an Hbase table.

Five, example

1. HBase table does not exist

CREATE EXTERNAL TABLE hive_table (key int, value string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "default:hbase_table");

Here is a brief description of the parameters when creating a table:

  • hbase.columns.mapping is required, which will be verified with the column family of HBase table.
  • The hbase.table.name attribute is optional. By default, the HBase table name is the same as the Hive table name.

At this time, both hive table and HBase table are empty. We prepare some data:

insert into hive_table (key, value) values(1, "www.ymq.io");

The insert statement will trigger the map task, as shown in the following figure:

After the task is completed, data exists in both Hive and HBase tables.

# Hive table data
+-----------------+-------------------+
| hive_table.key  | hive_table.value  |
+-----------------+-------------------+
| 1               | www.ymq.io        |
+-----------------+-------------------+
# HBase table data
hbase(main):002:0> scan 'hbase_table'
ROW                                    COLUMN+CELL                                                                                                  
 1                                     column=cf1:val, timestamp=1558710260266, value=www.ymq.io                                                    
1 row(s)
Took 0.2400 seconds

When the hive table table is deleted, the corresponding HBase table is not affected, and there is still data in it. When the HBase table is deleted and the data of the hive table is queried, an error will be reported: error: java.io.ioexception: org. Apache. Hadoop. HBase. Tablenotfoundexception: HBase Ou table (state =, code = 0), which is normal.

Be careful! Be careful! Note: in the above example, we use the insert command to insert data into the Hive table. For bulk data insertion, the load command is still recommended, but for Hive external tables, the load command is not supported. We can first create a Hive internal table, load the data into the table, and finally insert all the data of the query internal table into the Hive external table associated with Hbase, which is equivalent to a transit.

2. HBase table already exists

Create HBase table:

create 'default:people', {NAME=>'basic_info'}, {NAME=>'other_info'}

Insert some data:

put 'people', '00017','basic_info:name','tom'
put 'people', '00017','basic_info:age','17'
put 'people', '00017','basic_info:sex','man'
put 'people', '00017','other_info:telPhone','176xxxxxxxx'
put 'people', '00017','other_info:country','China'

put 'people', '00023','basic_info:name','mary'
put 'people', '00023','basic_info:age',23
put 'people', '00023','basic_info:sex','woman'
put 'people', '00023','basic_info:edu','college'
put 'people', '00023','other_info:email','cdsvo@163.com'
put 'people', '00023','other_info:country','Japan'

put 'people', '00018','basic_info:name','sam'
put 'people', '00018','basic_info:age','18'
put 'people', '00018','basic_info:sex','man'
put 'people', '00018','basic_info:edu','middle'
put 'people', '00018','other_info:telPhone','132xxxxxxxx'
put 'people', '00018','other_info:country','America'

put 'people', '00026','basic_info:name','Sariel'
put 'people', '00026','basic_info:age',26
put 'people', '00026','basic_info:edu','college'
put 'people', '00026','other_info:telPhone','178xxxxxxxx'
put 'people', '00026','other_info:email','12345@126.com'
put 'people', '00026','other_info:country','China'

Create a simple Hive external table with the same syntax as before:

create external table people
(
id int,
name string,
age string,
sex string, 
edu string, 
country string, 
telPhone string,  
email string
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping" = "
:key,
basic_info:name,
basic_info:age,
basic_info:sex,
basic_info:edu,
other_info:country,
other_info:telPhone,
other_info:email
")
tblproperties("hbase.table.name" = "default:people");

Query all data:

select * from people;

Condition query:

# Query by gender
select * from people where sex = 'man';
# Query by age
select * from people where age > 18;

In this way, we can use Hive to analyze the data in HBase.

Six, summary

  • Use Hive HBase handler xxx.jar package to associate Hive with HBase.
  • Hive reads the latest data of HBase table.
  • The HBase table created by Hive has only one VERSION by default, and the maximum number of versions can be modified later.
  • Hive only displays the column values corresponding to HBase, while those without HBase are not displayed in hive table.
  • After the Hive table is associated with HBase table, the data can be inserted in Hive or HBase.
  • Create the association between Hive external table and HBase, and import Hive data into HBase. This method is completed by using their own external API interfaces to communicate with each other. When the data volume is small (below 4T), this method can be selected to import data.

Pay attention, don't get lost

Well, everyone, that's the whole content of this article. You can see that the people here are all talents.

White whoring is not good, creation is not easy. Your support and recognition is the biggest driving force of my creation. See you next article!

If there are any mistakes in this blog, please comment and advice, thank you very much!

This article comes from: WeChat public number [big data actual combat drill]. Read more wonderful articles. Welcome to WeChat public.

Tags: Big Data HBase hive Apache Hadoop

Posted on Fri, 31 Jan 2020 14:23:36 -0500 by buck2bcr