hiveSQL execution plan (explain the most detailed in the whole network!!)

0 - Preface

Hive SQL's execution plan describes the overall outline of the actual execution of SQL. Through the execution plan, you can understand the execution logic of the SQL program when it is converted into the corresponding computing engine. If you master the execution logic, you can better grasp the bottleneck of the program, so as to achieve more targeted optimization. It can be said that the execution plan is a key to open the door of SQL optimization.
To learn the SQL execution plan, you need to learn the command to view the execution plan: explain. Adding the keyword explain in front of the SQL of the query statement is the basic method to view the execution plan.
Learning to explain can bring great convenience to the use of hive in our work!

1 - view SQL execution plan

The execution plan provided by Hive can view the following information:

explain: view the basic information of the execution plan;
explain dependency: the use of dependency in an explain statement produces additional information about the inputs in the plan. It shows the various properties entered
explain authorization: view information about permissions related to SQL operations;
explain vectorization: view the vectorization description information of SQL to show why Map and Reduce are not vectorized. Support from Hive 2.3.0;
explain analyze: annotate the plan with the actual number of lines. Support from Hive 2.2.0;
explain cbo: output the plan generated by the compute optimizer. CBO is supported from hive version 4.0.0;
explain locks: This is useful for understanding which LOCKS the system will acquire to run the specified query. LOCKS is supported from Hive 3.2.0;
explain ast: output the abstract syntax tree of the query. AST was deleted in hive version 2.1.0. There is a bug. Dumping AST may lead to OOM errors, which will be repaired in version 4.0.0;
explain extended: add extended to output additional information about the plan. This is usually physical information, such as file names, which is of little use to us;

1.1-explain usage

Hive provides the explain command to show the execution plan of a query. This execution plan is very helpful for us to understand the underlying principle, hive tuning, troubleshooting data skew, etc.

The syntax used is as follows:

explain query;

In hive cli, enter the following command (hive 2.3.7):

explain select sum(id) from test1;

Results obtained:

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: test1 Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: id (type: int) outputColumnNames: id Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: sum(id) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Reduce Operator Tree: Group By Operator aggregations: sum(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

How do you feel after reading the above contents? Don't worry. Each parameter will be explained in detail below. I believe you will be able to read the explain query results after learning the following contents.

A HIVE query is transformed into a sequence of one or more stages (directed acyclic graph DAG). These stages can be MapReduce stage s, stages responsible for metadata storage, or stages responsible for file system operations (such as moving and renaming).

Let's split the above results into two parts, starting from the outermost layer:

stage dependencies: dependencies between stages
stage plan: the execution plan of each stage

Let's first look at the stage dependencies in the first part, which contains two stages. Stage-1 is the root stage, indicating that this is the starting stage. Stage-0 depends on stage-1. After stage-1 is executed, stage-0 is executed.

Let's look at the second part of the stage plan. There is a Map Reduce and an MR implementation plan is divided into two parts

Map Operator Tree: the execution plan tree on the MAP side
Reduce Operator Tree: the execution plan tree on the reduce side

The two execution plan trees contain the operator of the sql statement

TableScan: table scanning operation. The first operation on the map side must be to load a table, so it is a table scanning operation. Common attributes:

alias: table name
Statistics: table statistics, including the number and size of data in the table

Select Operator: select operation. Common properties:

expressions: required field name and field type
outputColumnNames: the name of the output column
Statistics: table statistics, including the number and size of data in the table

Group By Operator: group aggregation operation. Common attributes:

aggregations: displays aggregate function information
Mode: aggregation mode. The values are hash: random aggregation, which is hash partition; partial: local aggregation; Final: final aggregation
keys: grouped field. If there is no grouping, there is no such field
outputColumnNames: output column names after aggregation
Statistics: table statistics, including the number and size of data after grouping and aggregation

Reduce Output Operator: output to reduce operation. Common attributes:

sort order: do not sort if the value is empty; The value is + positive sort, and the value is - reverse sort; The columns with a value of ± are sorted in two columns. The first column is in positive order and the second column is in reverse order

Filter Operator: filter operation, common attributes:

Predict: filter condition. If where id > = 1 in sql statement, it will be displayed here (ID > = 1)

Map Join Operator: join operation, common attributes:

condition map: join mode, such as inner join 0 to 1 left outer join 0 to 2
Keys: condition field of join
outputColumnNames: the fields output after the join is completed
Statistics: the number and size of data pieces generated after the join is completed

File Output Operator: file output operation, common attributes:

compressed: whether to compress
Table: table information, including input / output file format, serialization, etc

Fetch Operator client obtains data. Common attributes:

Limit. A value of - 1 means no limit on the number of entries. Other values are the number of entries

2-explain usage scenario

This section introduces the convenience and confusion that explain can bring to our production practice

Will the 2.1-join statement filter null values?

Now, enter the following query plan statement in hive cli

select a.id,b.user_name from test1 a join test2 b on a.id=b.id;

Q: will the above join statement filter the value with null id?
Execute the following statement:

explain select a.id,b.user_name from test1 a join test2 b on a.id=b.id;

Let's look at the results (only part of the output information is intercepted to adapt to the page display):

TableScan alias: a Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: id is not null (type: boolean) Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: id (type: int) outputColumnNames: _col0 Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 _col0 (type: int) 1 _col0 (type: int) ...

From the above results, we can see that the line predict: ID is not null indicates that the null value of the associated field will be automatically filtered during join, but the null value will not be automatically filtered by left join or full join. You can try it yourself.

Will the 2.2-group by statement sort?

Look at the following sql:

select id,max(user_name) from test1 group by id;

Q: does the group by statement sort?
Take a direct look at the results after explain (only part of the output information is intercepted to adapt to the page display)

TableScan alias: test1 Statistics: Num rows: 9 Data size: 108 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: id (type: int), user_name (type: string) outputColumnNames: id, user_name Statistics: Num rows: 9 Data size: 108 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: max(user_name) keys: id (type: int) mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 9 Data size: 108 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 9 Data size: 108 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: string) ...

Let's look at the Group By Operator. There are keys: id (type: int) instructions for grouping by id, and then look down, and sort order: + instructions for sorting by id field in positive order.

2.3 - which sql is efficient?

Observe the following two sql statements:

SELECT a.id, b.user_name FROM test1 a JOIN test2 b ON a.id = b.id WHERE a.id > 2;

SELECT a.id, b.user_name FROM (SELECT * FROM test1 WHERE id > 2) a JOIN test2 b ON a.id = b.id;

The output results of these two sql statements are the same, but which sql is efficient?

Some people say that the first sql has high execution efficiency, because the second sql has sub queries, which will affect the performance;
Some people say that the second sql has high execution efficiency, because after filtering, the number of entries in the join is reduced, so the execution efficiency is high.
Which sql is more efficient? Let's add explain in front of the sql statement and see the execution plan!

Add explain before the first sql statement to get the following results:

hive (default)> explain select a.id,b.user_name from test1 a join test2 b on a.id=b.id where a.id >2; OK Explain STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: $hdt$_0:a Fetch Operator limit: -1 Alias -> Map Local Operator Tree: $hdt$_0:a TableScan alias: a Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (id > 2) (type: boolean) Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: id (type: int) outputColumnNames: _col0 Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 _col0 (type: int) 1 _col0 (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: b Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (id > 2) (type: boolean) Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: id (type: int), user_name (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: int) 1 _col0 (type: int) outputColumnNames: _col0, _col2 Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col2 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

Add explain before the second sql statement to get the following results:

hive (default)> explain select a.id,b.user_name from(select * from test1 where id>2 ) a join test2 b on a.id=b.id; OK Explain STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: $hdt$_0:test1 Fetch Operator limit: -1 Alias -> Map Local Operator Tree: $hdt$_0:test1 TableScan alias: test1 Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (id > 2) (type: boolean) Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: id (type: int) outputColumnNames: _col0 Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 _col0 (type: int) 1 _col0 (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: b Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (id > 2) (type: boolean) Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: id (type: int), user_name (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: int) 1 _col0 (type: int) outputColumnNames: _col0, _col2 Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col2 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

What do you find? Except for different table aliases, other execution plans are exactly the same. They are filtered by where conditions first and associated by join conditions. It shows that hive bottom layer will automatically help us optimize, so the execution efficiency of these two sql statements is the same.
The above only lists three familiar and somewhat confused examples in our production. explain has many other uses, such as checking the dependence of stage, checking data skew, hive optimization, etc. small partners can try it by themselves.

Usage of 3-explain dependency

explain dependency is used to describe the data source required for a piece of SQL. The output is a json format data, which contains the following two parts:

input_partitions: describes the partition of the data source table on which a segment of SQL depends. The partition name list is stored in it. If all tables contained in the whole segment of SQL are non partitioned tables, it will be empty.
input_tables: describes a SQL dependent data source table, which stores a list of Hive table names.

Use explain dependency to view the non partitioned ordinary table of SQL query, and enter the following command in hive cli:

explain dependency select s_age,count(1) num from student_orc;

The results are as follows:

{"input_partitions":[],"input_tables":[{"tablename":"default@student_tb _orc","tabletype":"MANAGED_TABLE"}]}

Use explain dependency to view the SQL query partition table, and enter the following command in hive cli:

explain dependency select s_age,count(1) num from student_orc_partition;

Results obtained:

{"input_partitions":[{"partitionName":"default@student_orc_partition@ part=0"}, {"partitionName":"default@student_orc_partition@part=1"}, {"partitionName":"default@student_orc_partition@part=2"}, {"partitionName":"default@student_orc_partition@part=3"}, {"partitionName":"default@student_orc_partition@part=4"}, {"partitionName":"default@student_orc_partition@part=5"}, {"partitionName":"default@student_orc_partition@part=6"}, {"partitionName":"default@student_orc_partition@part=7"}, {"partitionName":"default@student_orc_partition@part=8"}, {"partitionName":"default@student_orc_partition@part=9"}], "input_tables":[{"tablename":"default@student_orc_partition", "tabletype":"MANAGED_TABLE"}]

There are two usage scenarios for explain dependency:

Scenario 1: quick exclusion. Quickly eliminate the exception of task data output caused by failure to read the data of the corresponding partition. For example, in a task divided by days, the upstream task is abnormal or runs away due to uncontrollable factors in the production process, resulting in exceptions caused by the downstream task. In this way, you can quickly check whether there are exceptions in the partition read by SQL.
Scenario 2: sort out the table input to help understand the operation of the program, especially the dependent input with multiple subqueries and multi table connections.

Here are two cases to see the practical application of explain dependency:

3.1 - identify seemingly equivalent codes

There are two seemingly equal sql statements:

Code 1:

select a.s_no from student_orc_partition a inner join student_orc_partition_only b on a.s_no=b.s_no and a.part=b.part and a.part>=1 and a.part<=2;

Code 2:

select a.s_no from student_orc_partition a inner join student_orc_partition_only b on a.s_no=b.s_no and a.part=b.part where a.part>=1 and a.part<=2;

Let's take a look at the output of the above two code explain dependency:

explain dependency result of code 1:

{"input_partitions": [{"partitionName":"default@student_orc_partition@part=0"}, {"partitionName":"default@student_orc_partition@part=1"}, {"partitionName":"default@student_orc_partition@part=2"}, {"partitionName":"default@student_orc_partition_only@part=1"}, {"partitionName":"default@student_orc_partition_only@part=2"}], "input_tables": [{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"}, {"tablename":"default@student_orc_partition_only","tabletype":"MANAGED_TABLE"}]}

explain dependency result of code 2:

{"input_partitions": [{"partitionName":"default@student_orc_partition@part=1"}, {"partitionName" : "default@student_orc_partition@part=2"}, {"partitionName" :"default@student_orc_partition_only@part=1"}, {"partitionName":"default@student_orc_partition_only@part=2"}], "input_tables": [{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"}, {"tablename":"default@student_orc_partition_only","tabletype":"MANAGED_TABLE"}]}

It can be seen from the above output results that the above two SQL statements are not equivalent. After adding non equivalent filter conditions to the connection condition (on) in the inner join, code 1 does not filter the left and right tables of the inner join according to the filter conditions, and the inner join will read more partition data with part=0 during execution. In code 2, unqualified partitions are filtered out.

3.2 - identify differences in the range of data read by SQL

There are two codes as follows:
Code 1:

explain dependency select a.s_no from student_orc_partition a left join student_orc_partition_only b on a.s_no=b.s_no and a.part=b.part and b.part>=1 and b.part<=2;

Code 2:

explain dependency select a.s_no from student_orc_partition a left join student_orc_partition_only b on a.s_no=b.s_no and a.part=b.part and a.part>=1 and a.part<=2;

Is the data reading range of the above two codes the same? The answer is different. Let's look at it through explain dependency:

explain dependency result of code 1:

{"input_partitions": [{"partitionName": "default@student_orc_partition@part=0"}, {"partitionName":"default@student_orc_partition@part=1"}, ...Omit 7 partitions in the middle {"partitionName":"default@student_orc_partition@part=9"}, {"partitionName":"default@student_orc_partition_only@part=1"}, {"partitionName":"default@student_orc_partition_only@part=2"}], "input_tables": [{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"}, {"tablename":"default@student_orc_partition_only","tabletype":"MANAGED_TABLE"}]}

explain dependency result of code 2:

{"input_partitions": [{"partitionName":"default@student_orc_partition@part=0"}, {"partitionName":"default@student_orc_partition@part=1"}, ...Omit 7 partitions in the middle {"partitionName":"default@student_orc_partition@part=9"}, {"partitionName":"default@student_orc_partition_only@part=0"}, {"partitionName":"default@student_orc_partition_only@part=1"}, ...Omit 7 partitions in the middle {"partitionName":"default@student_orc_partition_only@part=9"}], "input_tables": [{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"}, {"tablename":"default@student_orc_partition_only","tabletype":"MANAGED_TABLE"}]}

It can be seen that non equivalent filtering conditions are added to the connection conditions for the left external connection. If the filtering conditions are applied to the right table (table b), the right table can only scan two partitions, but the left table (table a) will scan the whole table. If the filtering condition is for the left table, it does not play the role of filtering at all, and the two tables will be scanned for the whole table. In this case, as with all external connections, it is necessary to scan the two data tables.

In the process of use, it is easy to think that code fragment 2 can filter data like code fragment 1. By viewing the output result of explain dependency, you can know that this is not the case.

4-explain authorization usage

Through explain authorization, you can know the data sources (INPUTS) and data OUTPUTS of the current SQL access, as well as the CURRENT_USER and OPERATION of the current Hive.

In hive cli, enter the following command:

explain authorization select variance(s_score) from student_tb_orc;

The results are as follows:

INPUTS: default@student_tb_orc OUTPUTS: hdfs://node01:8020/tmp/hive/hdfs/cbf182a5-8258-4157-9194- 90f1475a3ed5/-mr-10000 CURRENT_USER: hdfs OPERATION: QUERY AUTHORIZATION_FAILURES: No privilege 'Select' found for inputs { database:default, table:student_ tb_orc, columnName:s_score}

From the above information:

The data source of the above case is the student in the defalut database_ tb_ Orc table;
The output path of the data is hdfs://node01:8020/tmp/hive/hdfs/cbf182a5-8258-4157-9194-90f1475a3ed5/-mr-10000；
The current operation user is hdfs, and the operation is query;
Looking at the above information, we will also see authorization_ The failures message indicates that you do not have query permission for the current input, but it can run normally if you run the above SQL. Why does this happen? Hive does not perform permission verification when permission management is not configured by default. All users in hive are super administrators. Even if specific users are not empowered, they can query normally.

5 - Summary

Through the above introduction to explain, we can find that there are many contents worthy of our study in explain. Reading the explain execution plan will help us optimize Hive SQL and improve our control over SQL.