Basic concepts and installation of hive learning

Basic concepts of hive

What is hive

Hive: A data statistics tool used by Facebook to solve massive structured logs.

Hive is a Hadoop-based data warehouse tool that maps structured data files to a table and provides SQL-like query capabilities.

The nature of hive

Convert HQL to MapReduce

(1) Data processed by Hive is stored in HDFS

(2) The underlying implementation of Hive analysis data is MapReduce

(3) The execution program runs on Yarn

Advantages and disadvantages of hive

Advantage

(1) The operation interface uses SQL-like syntax to provide the ability to develop quickly (simple and easy to use).

(2) Avoid writing MapReduce and reduce learning costs for developers.

(3) Hive's execution latency is high, so Hive is often used for data analysis where real-time requirements are not high.

(4) Hive has the advantage of processing large data, but not small data because of its high execution latency.

(5) Hive supports user-defined functions, which users can implement according to their own needs.

shortcoming

Hive's ability to express HQL is limited

(1) Iterative algorithm cannot be expressed

(2) Data mining is not good at it. Due to the limitations of MapReduce data processing, more efficient algorithms cannot be implemented.

Hive is less efficient

(1) MapReduce jobs generated automatically by Hive, which are usually not intelligent enough

(2) Hive is difficult to tune and has a coarse granularity.

hive architecture principles

User interface: Client

CLI (command-line interface), JDBC/ODBC(jdbc accesses hive), WEBUI (browser accesses hive)
Metadata: Metastore

Metadata includes the name of the table, the database to which the table belongs (default), the owner of the table, the column/partition field, the type of table (whether it is an external table), the directory in which the table's data resides, and so on.

The default is stored in the own derby database, MySQL is recommended for storing Metastore
Hadoop

Use HDFS for storage and MapReduce for calculation.
Drive: Driver

(1) Parser (SQL Parser): Convert SQL strings to abstract syntax tree AST, which is usually done with third-party tool libraries, such as antlr; parse AST, such as whether tables exist, fields exist, and SQL semantics are incorrect.

(2) Physical Plan: Compile AST to generate a logical execution plan.

(3) Query Optimizer: Optimize the logical execution plan.

(4) Execution: Convert a logical execution plan into a physical plan that can be run. For Hive, this is MR/Spark.

Hive receives the user's instructions (SQL) through a series of interactive interfaces provided to the user, uses its own Driver, combines metadata (MetaStore), translates these instructions into MapReduce, submits them to Hadoop for execution, and finally outputs the results returned by execution to the user's interactive interface.

hive vs. database

Because Hive uses SQL-like query language HQL (Hive Query Language)Hive is easy to understand as a database. In fact, Hive and databases have no similar query language in structure. This article will discuss the differences between Hive and databases in many ways. Databases can be used in Online applications, but Hive is designed for data warehouse, so it is helpful to understand this from an application perspective.Features of Hive.

query language

Because SQL is widely used in data warehouses, a SQL-like query language HQL has been designed specifically for the features of Hive. Developers familiar with SQL development can easily use Hive for development.

Data Update

Because Hive is designed for data warehouse applications and the contents of the data warehouse are read more and write less. Therefore, rewriting data is not recommended in Hive. All data is determined when it is loaded. Data in the database usually needs to be modified frequently, so you can use INSERT INTO...VALUES to add data, UPDATE...SET modifies data.

Delay in execution

Hive has a high latency when querying data because it does not have an index and needs to scan the entire table. Another factor that causes high latency in Hive execution is the MapReduce framework. Since MapReduce itself has a high latency, it is used in MapReduceThere is also a higher latency when executing Hive queries. Relatively, the database has a lower latency. Of course, this low latency is conditional, that is, the data size is small, and when the data size exceeds the processing power of the database, Hive's parallel computing clearly shows an advantage.

Data Size

Because Hive is clustered and can use MapReduce for parallel computing, it can support large amounts of data; accordingly, databases can support smaller amounts of data.

hive installation

Hive Installation Address

Official address

http://hive.apache.org/

Document Viewing Address

https://cwiki.apache.org/confluence/display/Hive/GettingStarted

Download Address

http://archive.apache.org/dist/hive/

Mysql Installation

1. Uninstall the included mysql

#1. Query the software that comes with it (note that mysql is called in centos6 and MariaDB in centos7) [atguigu@hadoop102 ~]$ rpm -qa |grep mysql [atguigu@hadoop102 ~]$ rpm -qa |grep mariadb mariadb-libs-5.5.56-2.el7.x86_64 # Or combine queries [atguigu@hadoop102 ~]$ rpm -qa|grep -i -E mysql\|mariadb (-i Ignore case,-E Use Regular) mariadb-libs-5.5.56-2.el7.x86_64 [atguigu@hadoop102 ~]$ #2. Uninstall the included mysql [atguigu@hadoop102 ~]$ sudo rpm -e --nodeps mariadb-libs-5.5.56-2.el7.x86_64 # Pipeline command Download [atguigu@hadoop102 ~]$ rpm -qa|grep -i -E mysql\|mariadb | xargs -n1 sudo rpm -e --nodeps

2. Transfer mysql installation package to 102 machine/opt/sorftware/

3. Unzip mysql installation package

# Create a directory for mysql-rpm first, and unzip the installation package into that directory [atguigu@hadoop102 software]$ tar -xvf mysql-5.7.28-1.el7.x86_64.rpm-bundle.tar -C ./mysql-rpm/ [atguigu@hadoop102 mysql-rpm]$ pwd /opt/software/mysql-rpm [atguigu@hadoop102 mysql-rpm]$ ll #Note: Only the following five packages need to be installed, in this order because of dependencies Total dosage 595272 mysql-community-common-5.7.28-1.el7.x86_64.rpm # mysql-community-libs-5.7.28-1.el7.x86_64.rpm # mysql-community-libs-compat-5.7.28-1.el7.x86_64.rpm # mysql-community-client-5.7.28-1.el7.x86_64.rpm # mysql-community-server-5.7.28-1.el7.x86_64.rpm # [atguigu@hadoop102 mysql-rpm]$

4. Perform rpm installation under installation directory

[atguigu@hadoop102 mysql-rpm]$ sudo rpm -ivh mysql package

Error installing mysql-server package

Solution

Install required components first: mysql-community-server-5.7.28-1.el7.x86_64 sudo yum install -y libaio # ok, then mysql-community-server-5.7.28-1.el7.x86_64.rpm can be installed normally

5. Delete all contents in the directory pointed to by datadir in the /etc/my.cnf file

This step is for a previous installation of mysql, skip it if you are installing it for the first time

6. Initialize the database

# Initialize creation of mysql user [atguigu@hadoop102 mysql-rpm]$ sudo mysqld --initialize --user=mysql

7. View the password of the temporarily generated root user

[atguigu@hadoop102 mysql-rpm]$ sudo cat /var/log/mysqld.log |grep password 2021-09-29T16:12:42.271437Z 1 [Note] A temporary password is generated for root@localhost: Pttf1a3=3.LQ [atguigu@hadoop102 mysql-rpm]$ # The generated password is Pttf1a3=3.LQ

8. Start the Mysql service

[atguigu@hadoop102 mysql-rpm]$ sudo systemctl start mysqld

9. Log on to the Mysql database

[atguigu@hadoop102 mysql-rpm]$ mysql -uroot -p Enter password:

10. The root user's password must be modified first, otherwise other operations will result in errors

mysql> set password = password('123456'); Query OK, 0 rows affected, 1 warning (0.00 sec) mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | mysql | | performance_schema | | sys | +--------------------+ 4 rows in set (0.00 sec) mysql>

11. Modify the root user in the user table under the mysql library to allow any ip connection

By default, mysql only needs to be connected with localhost, so you need to modify to allow other ip access

update user set host ='%' where user = 'root'

Next, you can use other clients to test the connection. I'm using Navicate15 here, but I can't connect at this time.

# One more section needs to be executed flush privileges; # Test connection again successfully

Hive Installation Deployment

1. Upload apache-hive-3.1.2-bin.tar.gz to/opt/software/

2. Unzip apache-hive-3.1.2-bin.tar.gz to/opt/module/

tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /opt/module

3. Change the decompressed apache-hive-3.1.2-bin.tar.gz name to hive-3.1.2

4. Modify/etc/profile.d/my_env.sh to add environment variables for hive

#Configuring environment variables for JDK #Declare JAVA_HOME variable JAVA_HOME=/opt/module/jdk1.8.0_212 #Configuring environment variables for HADOOP # Declare HADOOP_HOME variable HADOOP_HOME=/opt/module/ha/hadoop-3.1.3 # Declare the HIVE_HOME variable HIVE_HOME=/opt/module/hive-3.1.2 # Declare the PATH variable and append $JAVA_HOME/bin $HADOOP_HOME/bin $HADOOP_HOME/sbin to the system's original PATH variable PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin #Promote PATH JAVA_HOME as a system global variable export JAVA_HOME HADOOP_HOME HIVE_HOME PATH

Source/etc/profile for environment variables to take effect

5. Resolve Log Jar Package Conflicts

hive has its own set of log jar packages, and hadoop has its own jar packages, so there will be conflicts

## Solution, delete the jar package that hive comes with, unify hadoop's [atguigu@hadoop102 software]$ mv $HIVE_HOME/lib/log4j-slf4j-impl-2.10.0.jar $HIVE_HOME/lib/log4j-slf4j-impl-2.10.0.bak

Hive is now installed

Hive Metadata Configuration to Mysql

Drive Copy

Copy mysql's JDBC driver package to Hive's lib directory

cp /opt/software/mysql-connector-java-5.1.37.jar $HIVE_HOME/lib

Configure Metastore to Mysql

Create a new hive-site.xml file in the $HIVE_HOME/conf directory, adding the following

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>  <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop102:3306/metastore?useSSL=false</value> </property>  <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property>  <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property>  <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123456</value> </property>  <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property>  <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property>  <property> <name>hive.metastore.event.db.notification.api.auth</name> <value>false</value> </property> </configuration>

Start Hive

First create the database metastore that hive needs to use in mysql

Initialize Hive Metabase

# This process is actually hive creating a large number of 74 tables in our mysql [atguigu@hadoop102 hive-3.1.2]$ schematool -initSchema -dbType mysql -verbose

ok, Hive is now installed and ready to use