Learning notes Hive - query optimization

1, View

1.1 Hive's view

  • View is a pseudo table created based on the basic table of the database. The definition of view is stored in the database without data items. Data items still exist in the basic table. It can be used as an abstraction layer to publish data to downstream users.
  • At present, Hive version supports logical view and does not support physical view. Therefore, Hive's data warehouse directory cannot find the view, but it can be found in the Mysql metabase.
  • The view can only be queried without data insertion and modification, which can improve the security of data.
  • The view is fixed when the view is created, and subsequent changes to the base table (such as adding columns) will not be reflected in the view.
  • If the view definition contains an ORDER BY/LIMIT statement, the ORDER BY/LIMIT statement operation is also performed when querying the view, and the priority defined in the view is higher.

1.2 advantages of reference view

  • Use views to reduce query complexity
  • Use views to limit data filtered based on criteria

1.3 view creation and Application

1.3.1 understand customer needs

1.3.2. Create view

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name   -- View name  
[(column_name [COMMENT column_comment], ...) ]    --Listing 
 [COMMENT view_comment]  --View notes  
[TBLPROPERTIES (property_name = property_value, ...)]  --Additional information  

1.3.3 view and delete view

1. View a view

desc view_name;

2. View a view's details

desc formatted view_name;

3. Delete view

DROP VIEW [IF EXISTS] [db_name.]view_name;

Task 1

Turn nested subqueries in the following nested queries into views

select t.types,t.goods,t.t_g_count from (select c.types,c.goods,count(1) t_g_count,row_number() over(partition by c.types order by count(1) desc) rank from(select a.*,b.types from goodsorder a left outer join goodstypes b on a.goods=b.goods) c group by c.types,c.goods) t where rank<=10

Results (partial):

2, Index

2.1 Hive index

  • Hive does not have the concept of primary key, but can establish an index. The design goal of the index is to improve the query speed of some columns of the table.
  • Creating an index on the specified column will generate an index table. The fields in the index table include: the value of the index column, the HDFS file path corresponding to the value, and the offset of the value in the file.
  • When the query involves index fields, first look up the HDFS file path and offset corresponding to the index column value in the index table, so as to avoid full table scanning.

Index table:

2.2 advantages of index

  • It can avoid full table scanning and resource waste
  • It can speed up the query of statements containing group by

2.3 index creation and Application

2.3.1. Create index

CREATE INDEX index_name     --Index name  
ON TABLE base_table_name (col_name, ...)  --Indexed columns  
AS index_type    --Index type  
[WITH DEFERRED REBUILD]    --Rebuild index 
 [IDXPROPERTIES (property_name=property_value, ...)]  --Index additional properties 
 [IN TABLE index_table_name]    --The name of the index table 
 [     [ ROW FORMAT ...] STORED AS ...      
 | STORED BY ...  ]   --Index table row separator, storage format 
 [LOCATION hdfs_path]  --Index table storage location 
 [TBLPROPERTIES (...)]   --Index table properties  
[COMMENT "index comment"];  --Indexing Notes 

2.3.2 automatic use of index

(set before index creation)

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET hive.optimize.index.filter=true;
SET hive.optimize.index.filter.compact.minsize=0;

2.3.3. View and delete index

1. Displays the indexes of all columns on the table


2. Delete index

DROP INDEX [IF EXISTS] index_name ON table_name;

Task 2

1. Create an index for the goodorders ID field

2. Query the order of the customer with id=10

3. Count the quantity of goods purchased by each customer according to the id group

3, Storage format

3.1 file storage format

  • Refers to the format of Hive table data storage
  • The default is text file format
  • There are row storage and column storage

3.2. Storage format setting

STORED AS (TextFile|RCFile|SequenceFile|ORC|Parquet)
Storage formatStorage modeCompression modecharacteristic
textFileStore by rowGzip,Bzip2The consumption of storage space is relatively large, and the compressed text cannot be divided and merged. The efficiency of query is the lowest. It can be stored directly, and the speed of loading data is the highest
SequenceFileStore by rowNONE,RECORD,BLOCK. The compression ratio of record is low, and block compression is generally recommendedThe storage space consumption is the largest. The compressed files can be divided and merged, and the query efficiency is high. It needs to be loaded through text file conversion
RCFileStore by columnThe storage space is small and the query efficiency is high. It needs to be loaded through text file conversion, and the loading speed is low. Compressed fast column access. The operation performance of reading full data may not have obvious advantages over sequence file
ORCFileStore by columnzlib(default),snappyFast compression, fast column access and higher efficiency than rcfile. It is an improved version of rcfile and does not support other query engines, such as impala
parquetStore by columnParquet has low compression and query efficiency, and does not support update, insert and ACID. However, parquet supports Impala query engine

The default is the TextFile storage format

Impala is CloudParquet, which has many similarities with ORC, but parquet is more intended to become a general storage format on hadoop. It can be used in combination with impala, Spark, Pig and other engines. It can specify the compression method of each column to achieve more efficient compression. Parquet is designed to support the storage of complex nested data, such as json

Task 3

Data (person.parquet):


1. Create the table person and set the data storage format of the table to parquet

2. Save the user.parquet file into the table

3. Query data

Tags: Database Big Data hive

Posted on Wed, 06 Oct 2021 08:36:41 -0400 by shawon22