hive basics and usage I

Chapter 1 Hive basic concepts

1.1 what is Hive

Hive: open source by Facebook to solve the data statistics of massive structured logs.

Hive: a data warehouse tool based on Hadoop, which can map structured data files into a table and provide SQL like query function.

The essence is to convert HQL into MapReduce program.

1) The data processed by Hive is stored in HDFS

2) The default implementation of Hive analysis data is MapReduce

3) The executor runs on Yan

1.2 advantages and disadvantages of hive

1.2.1 advantages

1) The operation interface adopts SQL like syntax to provide the ability of rapid development (simple and easy to use).

2) It avoids writing MapReduce and reduces the learning cost of developers.

3) Hive has a high execution delay, so hive is often used for data analysis where real-time performance is not required.

4) Hive has the advantage of processing big data, but not small data, because hive has a high execution delay.

5) Hive supports user-defined functions. Users can implement their own functions according to their own needs.

1.2.2 disadvantages

1) Hive's HQL expression ability is limited
(1) Iterative algorithms cannot be expressed

(2) Not good at data mining

2) Hive is less efficient
(1) MapReduce jobs automatically generated by Hive are usually not intelligent enough

(2) Hive tuning is difficult and coarse-grained

1.3 Hive Architecture Principle


1) User interface: Client
CLI (hive shell), JDBC/ODBC(java access hive), WEBUI (browser access hive)

2) Metadata: Metastore
Metadata includes: table name, database to which the table belongs (default by default), table owner, column / partition field, table type (external table or not), table data directory, etc;

It is stored in the built-in derby database by default. MySQL is recommended to store Metastore

3)Hadoop
Use HDFS for storage and MapReduce for calculation.

4) Drive: Driver
(1) SQL Parser: convert the SQL string into an abstract syntax tree AST. This step is generally completed by a third-party tool library, such as antlr; Perform syntax analysis on AST, such as whether the table exists, whether the field exists, and whether the SQL semantics is incorrect.

(2) Physical Plan: compile AST to generate logical execution plan.

(3) Query Optimizer: optimizes the logical execution plan.

(4) Execution: converts a logical execution plan into a physical plan that can be run. For Hive, it's MR/Spark.


Hive receives the user's instructions (SQL) through a series of interactive interfaces provided to the user, uses its own Driver and metadata (MetaStore), translates these instructions into MapReduce, submits them to Hadoop for execution, and finally outputs the returned results to the user's interactive interface.

1.4 Hive and database comparison

Because Hive adopts the query language HQL(Hive Query Language) similar to SQL, it is easy to understand Hive as a database. In fact, from the structural point of view, Hive and database have no similarities except that they have similar query languages. This section will explain the differences between Hive and database from many aspects. Database can be used in Online applications, but Hive is designed for data warehouse. Knowing this helps to understand Hive's characteristics from an application perspective.

1.4.1 query language

Because SQL is widely used in data warehouse, an SQL like query language HQL is designed specifically for the characteristics of Hive. Developers familiar with SQL development can easily use Hive for development.

1.4.2 data storage location

Hive is built on Hadoop, and all hive data is stored in HDFS. The database can save the data in a block device or a local file system.

1.4.3 data update

Hive is designed for data warehouse applications, and the content of data warehouse is read more and write less. Therefore, data rewriting is not recommended in hive. All data is determined when loading. The data in the database usually needs to be modified frequently, so you can use INSERT INTO... VALUES to add data and UPDATE... SET to modify data.

1.4.4 index

Hive does not process or even scan the data during data loading, so it does not index some keys in the data. Hive needs to scan the whole data violently when it wants to access a specific value in the data that meets the conditions, so the access latency is high. Due to the introduction of MapReduce, hive can access data in parallel. Therefore, even if there is no index, hive can still reflect its advantages in accessing large amounts of data. In the database, the index is usually established for one or several columns, so the database can have high efficiency and low latency for the access of a small amount of data with specific conditions. Due to the high data access delay, hive is not suitable for online data query.

1.4.5 execution

Most queries in Hive are executed through MapReduce provided by Hadoop. Databases usually have their own execution engine.

1.4.6 execution delay

When Hive queries data, it needs to scan the whole table because there is no index, so the delay is high. Another factor causing high Hive execution latency is the MapReduce framework. Because MapReduce itself has high latency, it will also have high latency when using MapReduce to execute Hive queries. In contrast, the execution latency of the database is low. Of course, this low is conditional, that is, the data scale is small. When the data scale is large enough to exceed the processing capacity of the database, Hive's parallel computing can obviously reflect its advantages.

1.4.7 scalability

Because Hive is built on Hadoop, the scalability of Hive is consistent with that of Hadoop (the world's largest Hadoop cluster is Yahoo!, with a scale of about 4000 nodes in 2009). Due to the strict restriction of ACID semantics, the extended rows in the database are very limited. At present, the theoretical expansion capacity of the most advanced parallel database Oracle is only about 100.

1.4.8 data scale

Hive is built on a cluster and can use MapReduce for parallel computing, so it can support large-scale data; Correspondingly, the database can support smaller data scale.

Chapter 2 Hive installation

2.1 Hive installation address

1) Hive official website address

 http://hive.apache.org/

2) Document viewing address

https://cwiki.apache.org/confluence/display/Hive/GettingStarted

3) Download address

http://archive.apache.org/dist/hive/

2.2 Hive installation and deployment

1) Hive installation and configuration
(1) Upload apache-hive-1.2.1-bin.tar.gz to the / opt/software directory of linux

(2) Unzip apache-hive-1.2.1-bin.tar.gz to / opt/module /

[atguigu@hadoop102 software]$ tar -zxvf apache-hive-1.2.1-bin.tar.gz -C /opt/module/

(3) Change the name of apache-hive-1.2.1-bin.tar.gz to hive

[atguigu@hadoop102 module]$ mv apache-hive-1.2.1-bin/ hive 

(4) Modify the name of hiv-env.sh.template in / opt/module/hive/conf directory to hiv-env.sh

[atguigu@hadoop102 conf]$ mv hive-env.sh.template hive-env.sh

(5) Configure the hive-env.sh file

(a)to configure HADOOP_HOME route
export HADOOP_HOME=/opt/module/hadoop-2.7.2 
(b)to configure HIVE_CONF_DIR route
export HIVE_CONF_DIR=/opt/module/hive/conf 

2) Hadoop cluster configuration
(1) hdfs and yarn must be started

[atguigu@hadoop102 hadoop-2.7.2]$ sbin/start-dfs.sh [atguigu@hadoop103 hadoop-2.7.2]$ sbin/start-yarn.sh

(2) Create / tmp and / user/hive/warehouse directories on HDFS and modify their same group permissions. They are writable (the system will create them automatically if you don't operate)

[atguigu@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -mkdir /tmp 
[atguigu@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -mkdir -p
/user/hive/warehouse
[atguigu@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -chmod g+w /tmp 
[atguigu@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -chmod g+w
/user/hive/warehouse

3) Hive basic operation
(1) Start hive

[atguigu@hadoop102 hive]$ bin/hive 

(2) View database

hive> show databases; 

(3) Open default database

hive> use default; 

(4) Displays the tables in the default database

hive> show tables; 

(5) Create a table

hive> create table student(id int, name string); 

(6) Displays how many tables are in the database

hive> show tables; 

(7) View table structure

hive> desc student; 

(8) Insert data into a table

hive> insert into student values(1000,"ss"); 

(9) Data in query table

hive> select * from student; 

(10) Exit hive

hive> quit; 

2.3 importing local files into Hive cases

Import the data in the local / opt/module/data/student.txt directory into the student (ID, int, name string) table of hive.

1) Data preparation
Prepare data in the directory / opt/module/data

(1) Create data in the / opt/module / directory

[atguigu@hadoop102 module]$ mkdir data 

(2) Create a student.txt file in the / opt / module / data / directory and add data

[atguigu@hadoop102 datas]$ touch student.txt

[atguigu@hadoop102 datas]$ vi student.txt 
1001 zhangshan

1002 lishi

1003 zhaoliu

Note: the interval is tab.
2) Hive actual operation
(1) Start hive

[atguigu@hadoop102 hive]$ bin/hive 

(2) Display database

hive> show databases; 

(3) Use default database

hive> use default; 

(4) Displays the tables in the default database

hive> show tables; 

(5) Delete the created student table

hive> drop table student; 

(6) Create student table and declare file separator '\ t'

hive> create table student(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATEDBY '\t';

(7) Load the / opt/module/data/student.txt file into the student database table.

hive> load data local inpath '/opt/module/data/student.txt' into table student;

(8) Hive query results

hive> select * from student; OK

1001 zhangshan

1002 lishi

1003 zhaoliu

Time taken: 0.266 seconds, Fetched: 3 row(s)

3. Problems encountered
If you open another client window and start hive, a java.sql.SQLException will be generated.

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException:

Unable to instantiate

org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

at org.apache.hadoop.hive.ql.session.SessionState.start(Session State.java:522)

at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677) 

at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621
)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
ssorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe thodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)  

at org.apache.hadoop.util.RunJar.run(RunJar.java:221) 

at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClien
t 
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance( MetaStoreUtils.java:1523)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<in
it>(RetryingMetaStoreClient.java:86) 

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.get
Proxy(RetryingMetaStoreClient.java:132) 

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.get Proxy(RetryingMetaStoreClient.java:104)

at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClien t(Hive.java:3005)

at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:302 4)

at org.apache.hadoop.hive.ql.session.SessionState.start(Session State.java:503)

... 8 more

The reason is that Metastore is stored in the built-in derby database by default. MySQL is recommended to store Metastore;

2.4 MySql installation

2.4.1 preparation of installation package

1) Check whether MySQL is installed. If so, uninstall mysql
(1) Check

[root@hadoop102 desktop]# rpm -qa|grep mysql 
mysql-libs-5.1.73-7.el6.x86_64

(2) Unload

[root@hadoop102 Tabletop ]# rpm -e --nodeps mysql-libs-5.1.73-7.el6.x86_64

2) Unzip the mysql-libs.zip file to the current directory

[root@hadoop102 software]# unzip mysql-libs.zip 
[root@hadoop102 software]# ls
mysql-libs.zip mysql-libs

3) Enter the MySQL LIBS folder

[root@hadoop102 mysql-libs]# ll 

Total consumption 76048
-rw-r--r--. 1 root root 18509960 3 June 26, 2015 MySQL-client-5.6.24-1.el6.x86_64.rpm
-rw-r--r--. 1 root root 3575135 12 January 2013 mysql-connector-java-5.1.27.tar.gz
-rw-r--r--. 1 root root 55782196 3 June 26, 2015 MySQL-server-5.6.24-1.el6.x86_64.rpm

2.4.2 installing MySql server

1) Install mysql server

[root@hadoop102 mysql-libs]# rpm -ivh MySQL-server-5.6.24-1.el6.x86_64.rpm

2) View the generated random password

[root@hadoop102 mysql-libs]# cat /root/.mysql_secret OEXaQuS8IWkG19Xs

3) View mysql status

[root@hadoop102 mysql-libs]# service mysql status 

4) Start mysql

[root@hadoop102 mysql-libs]# service mysql start 

2.4.3 installing MySql client

1) Install mysql client

[root@hadoop102 mysql-libs]# rpm -ivh MySQL-client-5.6.24-1.el6.x86_64.rpm

2) Link mysql

[root@hadoop102 mysql-libs]# mysql -uroot -pOEXaQuS8IWkG19Xs 

3) Change password

mysql>SET PASSWORD=PASSWORD('000000'); 

4) Exit mysql

mysql>exit 

2.4.4 host configuration in user table in MySQL

As long as the configuration is root + password, you can log in to MySQL database on any host.

1) Enter mysql

[root@hadoop102 mysql-libs]# mysql -uroot -p000000 

2) Display database

mysql>show databases; 

3) Using mysql database

mysql>use mysql; 

4) Show all tables in mysql database

mysql>show tables; 

5) Show the structure of the user table

mysql>desc user; 

6) Query user table

mysql>select User, Host, Password from user; 

7) Modify the user table and modify the contents of the Host table to%

mysql>update user set host='%' where host='localhost'; 

8) Delete other host s of root user

delete from user where Host='hadoop102';
delete from user where Host='127.0.0.1'; 
delete from user where Host='::1';

9) Refresh

mysql>flush privileges; 

10) Exit

mysql>quit; 

2.5 Hive metadata configuration to MySql

2.5.1 driver copy

1) In the / opt / software / MySQL LIBS directory, unzip the mysql-connector-java-5.1.27.tar.gz driver package

[root@hadoop102 mysql-libs]# tar -zxvf mysql-connector-java-5.1.27.tar.gz

2) Copy mysql-connector-java-5.1.27-bin.jar to / opt/module/hive/lib/

[root@hadoop102 mysql-connector-java-5.1.27]# cp
/opt/software/mysql-libs/mysql-connector-java-5.1.27/mysql-c onnector-java-5.1.27-bin.jar/opt/module/hive/lib/

2.5.2 configuring Metastore to MySql

1) Create a hive-site.xml in the / opt/module/hive/conf directory

[atguigu@hadoop102 conf]$ touch hive-site.xml
[atguigu@hadoop102 conf]$ vi hive-site.xml

2) According to the configuration parameters of the official document, copy the data to the hive-site.xml file

https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

<name>javax.jdo.option.ConnectionURL</name>

 

<value>jdbc:mysql://hadoop102:3306/metastore?createDatabaseI fNotExist=true</value>

<description>JDBC connect string for a JDBC

metastore</description>

</property>

 

<property>

 

--------------—
<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>Driver class name for a JDBC metastore</description>

</property>

 

<property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>root</value>

<description>username to use against metastore database</description>

</property>

 

<property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>000000</value>

<description>password to use against metastore database</description>

</property>

</configuration>

3) After configuration, if hive exception is started, you can restart the virtual machine. (after restarting, don't forget to start the hadoop cluster)

2.5.3 multi window start Hive test

1) Start MySQL first

[atguigu@hadoop102 mysql-libs]$ mysql -uroot -p000000
 How many databases are there
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| test |
+--------------------+

2) Open multiple windows again and start hive respectively

hive[atguigu@hadoop102 hive]$ bin/hive

3) After you start hive, go back to the MySQL window to view the database. The metastore database is added

mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| metastore |
| mysql |
| performance_schema |
| test |
+--------------------+

2.6 HiveJDBC access

2.6.1 start the hiveserver2 service

[atguigu@hadoop102 hive]$ bin/hiveserver2 

2.6.2 start beeline

[atguigu@hadoop102 hive]$ bin/beeline Beeline version 1.2.1 by Apache Hive beeline>

2.6.3 connecting the hiveserver2

beeline> !connect jdbc:hive2://Hadoop 102:10000 (enter)

Connecting to jdbc:hive2://hadoop102:10000

Enter username for jdbc:hive2://Hadoop 102:10000: atguigu (enter)

Enter password for jdbc:hive2://Hadoop 102:10000: (enter directly)

Connected to: Apache Hive (version 1.2.1) Driver: Hive JDBC (version 1.2.1)

Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://hadoop102:10000> show databases;

+----------------+--+

| database_name |

+----------------+--+

| default |

| hive_db2 |

+----------------+--+

2.7 Hive common interactive commands

[hdfs@bigdata-dw-nd-01 feng.zhao]$ hive -help 
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: Using incremental CMS is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
usage: hive
 -d,--define <key=value>          Variable subsitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable subsitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

2.8 Hive other command operations

1) How to view the hdfs file system in the hive cli command window
hive> dfs -ls /;

2) How to view the local file system in the hive cli command window
hive> ! ls /opt/module/datas;

3) View all historical commands entered in hive
(1) Enter the root directory of the current user / root or / home/atguigu

(2) View. Historyfile

	[atguigu@hadoop102 ~]$ cat .hivehistory 

2.9 Hive common attribute configuration

2.9.1 Hive data warehouse location configuration

1) The original location of the Default data warehouse is under the: / user/hive/warehouse Path on hdfs.

2) In the warehouse directory, no folder is created for the default database default. If a table belongs to default

Database, directly create a folder under the data warehouse directory.

3) Modify the original location of the default data warehouse (copy the following configuration information of hive-default.xml.template to the

hive-site.xml file).

hive.metastore.warehouse.dir

/user/hive/warehouse

location of default database for the warehouse

Configure that users in the same group have execution permission

bin/hdfs dfs -chmod g+w /user/hive/warehouse

2.9.2 information display configuration after query

1) Add the following configuration information to the hive-site.xml file to display the header information of the current database and query table.

hive.cli.print.header

true

hive.cli.print.current.db

true

2) Restart hive and compare the differences before and after configuration.

2.9.3 Hive operation log information configuration

1) Hive's log is stored in / tmp/atguigu/hive.log directory by default (under the current user name)
2) Modify the log of hive and save the log to / opt/module/hive/logs

(1) Modify the file name of / opt/module/hive/conf/hive-log4j.properties.template to hive-log4j.properties
[atguigu@hadoop102 conf]$ pwd

/opt/module/hive/conf

[atguigu@hadoop102 conf]$ mv hive-log4j.properties.template hive-log4j.properties

(2) Modify the log storage location in hive-log4j.properties file

hive.log.dir=/opt/module/hive/logs

2.9.4 parameter configuration mode

1) View all current configuration information
hive>set;

2) There are three ways to configure parameters
(1) Profile mode

Default configuration file: hive-default.xml user defined configuration file: hive-site.xml

Note: user defined configuration overrides the default configuration. In addition, Hive will also read the Hadoop configuration, because Hive is started as a Hadoop client, and the Hive configuration will overwrite the Hadoop configuration. The configuration file settings are valid for all Hive processes started locally.

(2) Command line parameter mode

When you start Hive, you can add - hiveconf param=value on the command line to set parameters. For example:

[atguigu@hadoop103 hive]$ bin/hive -hiveconf mapred.reduce.tasks=10;

Note: it is only valid for this hive startup

View parameter settings:

hive (default)> set mapred.reduce.tasks;

(3) Parameter declaration method

You can use the SET keyword in HQL to SET parameters, for example:

hive (default)> set mapred.reduce.tasks=100;

Note: only valid for this hive startup.

View parameter settings

hive (default)> set mapred.reduce.tasks;

The priority of the above three setting methods increases in turn. That is, configuration file < command line parameter < parameter declaration. Note that some system level parameters, such as log4j related settings, must be set in the first two ways, because the reading of those parameters has been completed before the session is established.

Tags: Database Big Data Hadoop hive

Posted on Fri, 01 Oct 2021 21:28:00 -0400 by stuworx