Basic concepts of hive
What is hive
Hive: A data statistics tool used by Facebook to solve massive structured logs.
Hive is a Hadoop-based data warehouse tool that maps structured data files to a table and provides SQL-like query capabilities.
The nature of hive
Convert HQL to MapReduce
(1) Data processed by Hive is stored in HDFS
(2) The underlying implementation of Hive analysis data is MapReduce
(3) The execution program runs on Yarn
Advantages and disadvantages of hive
Advantage
(1) The operation interface uses SQL-like syntax to provide the ability to develop quickly (simple and easy to use).
(2) Avoid writing MapReduce and reduce learning costs for developers.
(3) Hive's execution latency is high, so Hive is often used for data analysis where real-time requirements are not high.
(4) Hive has the advantage of processing large data, but not small data because of its high execution latency.
(5) Hive supports user-defined functions, which users can implement according to their own needs.
shortcoming
Hive's ability to express HQL is limited
(1) Iterative algorithm cannot be expressed
(2) Data mining is not good at it. Due to the limitations of MapReduce data processing, more efficient algorithms cannot be implemented.
Hive is less efficient
(1) MapReduce jobs generated automatically by Hive, which are usually not intelligent enough
(2) Hive is difficult to tune and has a coarse granularity.
hive architecture principles
-
User interface: Client
CLI (command-line interface), JDBC/ODBC(jdbc accesses hive), WEBUI (browser accesses hive)
-
Metadata: Metastore
Metadata includes the name of the table, the database to which the table belongs (default), the owner of the table, the column/partition field, the type of table (whether it is an external table), the directory in which the table's data resides, and so on.
The default is stored in the own derby database, MySQL is recommended for storing Metastore
-
Hadoop
Use HDFS for storage and MapReduce for calculation.
-
Drive: Driver
(1) Parser (SQL Parser): Convert SQL strings to abstract syntax tree AST, which is usually done with third-party tool libraries, such as antlr; parse AST, such as whether tables exist, fields exist, and SQL semantics are incorrect.
(2) Physical Plan: Compile AST to generate a logical execution plan.
(3) Query Optimizer: Optimize the logical execution plan.
(4) Execution: Convert a logical execution plan into a physical plan that can be run. For Hive, this is MR/Spark.
Hive receives the user's instructions (SQL) through a series of interactive interfaces provided to the user, uses its own Driver, combines metadata (MetaStore), translates these instructions into MapReduce, submits them to Hadoop for execution, and finally outputs the results returned by execution to the user's interactive interface.
hive vs. database
Because Hive uses SQL-like query language HQL (Hive Query Language)Hive is easy to understand as a database. In fact, Hive and databases have no similar query language in structure. This article will discuss the differences between Hive and databases in many ways. Databases can be used in Online applications, but Hive is designed for data warehouse, so it is helpful to understand this from an application perspective.Features of Hive.
query language
Because SQL is widely used in data warehouses, a SQL-like query language HQL has been designed specifically for the features of Hive. Developers familiar with SQL development can easily use Hive for development.
Data Update
Because Hive is designed for data warehouse applications and the contents of the data warehouse are read more and write less. Therefore, rewriting data is not recommended in Hive. All data is determined when it is loaded. Data in the database usually needs to be modified frequently, so you can use INSERT INTO...VALUES to add data, UPDATE...SET modifies data.
Delay in execution
Hive has a high latency when querying data because it does not have an index and needs to scan the entire table. Another factor that causes high latency in Hive execution is the MapReduce framework. Since MapReduce itself has a high latency, it is used in MapReduceThere is also a higher latency when executing Hive queries. Relatively, the database has a lower latency. Of course, this low latency is conditional, that is, the data size is small, and when the data size exceeds the processing power of the database, Hive's parallel computing clearly shows an advantage.
Data Size
Because Hive is clustered and can use MapReduce for parallel computing, it can support large amounts of data; accordingly, databases can support smaller amounts of data.
hive installation
Hive Installation Address
Official address
http://hive.apache.org/
Document Viewing Address
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
Download Address
http://archive.apache.org/dist/hive/
Mysql Installation
1. Uninstall the included mysql
#1. Query the software that comes with it (note that mysql is called in centos6 and MariaDB in centos7) [atguigu@hadoop102 ~]$ rpm -qa |grep mysql [atguigu@hadoop102 ~]$ rpm -qa |grep mariadb mariadb-libs-5.5.56-2.el7.x86_64 # Or combine queries [atguigu@hadoop102 ~]$ rpm -qa|grep -i -E mysql\|mariadb (-i Ignore case,-E Use Regular) mariadb-libs-5.5.56-2.el7.x86_64 [atguigu@hadoop102 ~]$ #2. Uninstall the included mysql [atguigu@hadoop102 ~]$ sudo rpm -e --nodeps mariadb-libs-5.5.56-2.el7.x86_64 # Pipeline command Download [atguigu@hadoop102 ~]$ rpm -qa|grep -i -E mysql\|mariadb | xargs -n1 sudo rpm -e --nodeps
2. Transfer mysql installation package to 102 machine/opt/sorftware/
3. Unzip mysql installation package
# Create a directory for mysql-rpm first, and unzip the installation package into that directory [atguigu@hadoop102 software]$ tar -xvf mysql-5.7.28-1.el7.x86_64.rpm-bundle.tar -C ./mysql-rpm/ [atguigu@hadoop102 mysql-rpm]$ pwd /opt/software/mysql-rpm [atguigu@hadoop102 mysql-rpm]$ ll #Note: Only the following five packages need to be installed, in this order because of dependencies Total dosage 595272 mysql-community-common-5.7.28-1.el7.x86_64.rpm # mysql-community-libs-5.7.28-1.el7.x86_64.rpm # mysql-community-libs-compat-5.7.28-1.el7.x86_64.rpm # mysql-community-client-5.7.28-1.el7.x86_64.rpm # mysql-community-server-5.7.28-1.el7.x86_64.rpm # [atguigu@hadoop102 mysql-rpm]$
4. Perform rpm installation under installation directory
[atguigu@hadoop102 mysql-rpm]$ sudo rpm -ivh mysql package
Error installing mysql-server package
Solution
Install required components first: mysql-community-server-5.7.28-1.el7.x86_64 sudo yum install -y libaio # ok, then mysql-community-server-5.7.28-1.el7.x86_64.rpm can be installed normally
5. Delete all contents in the directory pointed to by datadir in the /etc/my.cnf file
This step is for a previous installation of mysql, skip it if you are installing it for the first time
6. Initialize the database
# Initialize creation of mysql user [atguigu@hadoop102 mysql-rpm]$ sudo mysqld --initialize --user=mysql
7. View the password of the temporarily generated root user
[atguigu@hadoop102 mysql-rpm]$ sudo cat /var/log/mysqld.log |grep password 2021-09-29T16:12:42.271437Z 1 [Note] A temporary password is generated for root@localhost: Pttf1a3=3.LQ [atguigu@hadoop102 mysql-rpm]$ # The generated password is Pttf1a3=3.LQ
8. Start the Mysql service
[atguigu@hadoop102 mysql-rpm]$ sudo systemctl start mysqld
9. Log on to the Mysql database
[atguigu@hadoop102 mysql-rpm]$ mysql -uroot -p Enter password:
10. The root user's password must be modified first, otherwise other operations will result in errors
mysql> set password = password('123456'); Query OK, 0 rows affected, 1 warning (0.00 sec) mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | mysql | | performance_schema | | sys | +--------------------+ 4 rows in set (0.00 sec) mysql>
11. Modify the root user in the user table under the mysql library to allow any ip connection
By default, mysql only needs to be connected with localhost, so you need to modify to allow other ip access
update user set host ='%' where user = 'root'
Next, you can use other clients to test the connection. I'm using Navicate15 here, but I can't connect at this time.
# One more section needs to be executed flush privileges; # Test connection again successfully
Hive Installation Deployment
1. Upload apache-hive-3.1.2-bin.tar.gz to/opt/software/
2. Unzip apache-hive-3.1.2-bin.tar.gz to/opt/module/
tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /opt/module
3. Change the decompressed apache-hive-3.1.2-bin.tar.gz name to hive-3.1.2
4. Modify/etc/profile.d/my_env.sh to add environment variables for hive
#Configuring environment variables for JDK #Declare JAVA_HOME variable JAVA_HOME=/opt/module/jdk1.8.0_212 #Configuring environment variables for HADOOP # Declare HADOOP_HOME variable HADOOP_HOME=/opt/module/ha/hadoop-3.1.3 # Declare the HIVE_HOME variable HIVE_HOME=/opt/module/hive-3.1.2 # Declare the PATH variable and append $JAVA_HOME/bin $HADOOP_HOME/bin $HADOOP_HOME/sbin to the system's original PATH variable PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin #Promote PATH JAVA_HOME as a system global variable export JAVA_HOME HADOOP_HOME HIVE_HOME PATH
Source/etc/profile for environment variables to take effect
5. Resolve Log Jar Package Conflicts
hive has its own set of log jar packages, and hadoop has its own jar packages, so there will be conflicts
## Solution, delete the jar package that hive comes with, unify hadoop's [atguigu@hadoop102 software]$ mv $HIVE_HOME/lib/log4j-slf4j-impl-2.10.0.jar $HIVE_HOME/lib/log4j-slf4j-impl-2.10.0.bak
Hive is now installed
Hive Metadata Configuration to Mysql
Drive Copy
Copy mysql's JDBC driver package to Hive's lib directory
cp /opt/software/mysql-connector-java-5.1.37.jar $HIVE_HOME/lib
Configure Metastore to Mysql
Create a new hive-site.xml file in the $HIVE_HOME/conf directory, adding the following
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- jdbc Connected URL --> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop102:3306/metastore?useSSL=false</value> </property> <!-- jdbc Connected Driver--> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <!-- jdbc Connected username--> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <!-- jdbc Connected password --> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123456</value> </property> <!-- Hive Default at HDFS Working Directory --> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <!-- Hive Validation of metadata storage --> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <!-- Metadata Store Authorization --> <property> <name>hive.metastore.event.db.notification.api.auth</name> <value>false</value> </property> </configuration>
Start Hive
First create the database metastore that hive needs to use in mysql
Initialize Hive Metabase
# This process is actually hive creating a large number of 74 tables in our mysql [atguigu@hadoop102 hive-3.1.2]$ schematool -initSchema -dbType mysql -verbose
ok, Hive is now installed and ready to use