The running in period of Flink and Hive

There is a lot of feedback from readers. Please refer to the previous article< Hive is finally waiting, Flink >When Flink is deployed and Hive is integrated, there are some bug s and compatibility problems. Although waiting, it is not available. So I added this article as a sister article.

review

In the previous article, the author used the CDH version of 5.16.2, of which the Hive version is 1.1.0 (the Hive version of CDH 5.x series is not higher than 1.1.0, is it incomprehensible or not), and the Flink source code itself does not have good compatibility with Hive version 1.1.0, and there are many problems. In order to be compatible with the current version, the author modified the Flink code based on CDH 5.16.2 environment, repackaged and deployed it.

In fact, after many open source projects, such as Apache Atlas, Apache Spark, Hive 1.2.x and Hive 1.1.x, in most cases, replacing some Jar packages can solve the problem of compatibility. For my environment, some Jar packages of Hive 1.2.1 can be used instead of Hive 1.1.0. In the beginning of this article, the author will solve this problem, and then add the missing content of the last article.

The problem of incessant pruning and confusion

According to the reader's feedback, the author summarizes all the questions into three categories:

How can Flink connect to Hive? Besides the API, is there any similar spark SQL command
The Hadoop environment is not recognized or the configuration file is not found
Dependent package, class or method not found

1. How Flink connects to Hive

Some readers don't know how to configure the Flink to connect to Hive's Catalog. Here is a complete conf / SQL client- Hive.yaml Example:

catalogs: - name: staginghive type: hive hive-conf-dir: /etc/hive/conf hive-version: 1.2.1 execution: planner: blink type: batch time-characteristic: event-time periodic-watermarks-interval: 200 result-mode: table max-table-result-rows: 1000000 parallelism: 1 max-parallelism: 128 min-idle-state-retention: 0 max-idle-state-retention: 0 current-catalog: staginghive current-database: ssb restart-strategy: type: fallback deployment: response-timeout: 5000 gateway-address: "" gateway-port: 0 m: yarn-cluster yn: 2 ys: 5 yjm: 1024 ytm: 2048

sql-client-hive.yaml The configuration file contains:

Hive profile paths are configured in hive profile catalogs.
The configuration information of Yarn is configured in the deployment of Yarn configuration information.
The execution engine information execution configures the blink planner and uses the batch mode. The batch mode is relatively stable, suitable for traditional batch processing jobs, and can be fault-tolerant. In addition, it is recommended to enable the compression function when the intermediate data is dropped. In addition to batch, Flink also supports streaming mode.

■ Flink SQL CLI tools

Similar to spark SQL command, Flink provides SQL CLI tool, namely sql-client.sh script. In Flink 1.10, Flink SQL CLI has improved many functions, which I will explain later.

sql-client.sh Use as follows:

$ bin/sql-client.sh embedded -d conf/sql-client-hive.yaml

2. The Hadoop environment cannot be identified or the configuration file cannot be found

The author mentioned in the previous article that to deploy CDH gateway on the environment where Flink is deployed, including Hadoop and Hive clients, in addition, some environment variables need to be configured, as follows:

export HADOOP_CONF_DIR=/etc/hadoop/conf export YARN_CONF_DIR=/etc/hadoop/conf export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive export HIVE_CONF_DIR=/etc/hive/conf

3. Dependent package, class or method not found

First, check the lib directory in the Flink home directory:

$ tree lib lib ├── flink-connector-hive_2.11-1.10.0.jar ├── flink-dist_2.11-1.10.0.jar ├── flink-hadoop-compatibility_2.11-1.10.0.jar ├── flink-shaded-hadoop-2-2.6.0-cdh5.16.2-9.0.jar ├── flink-table_2.11-1.10.0.jar ├── flink-table-blink_2.11-1.10.0.jar ├── hive-exec-1.1.0-cdh5.16.2.jar ├── hive-metastore-1.1.0-cdh5.16.2.jar ├── libfb303-0.9.3.jar ├── log4j-1.2.17.jar └── slf4j-log4j12-1.7.15.jar

If the first two problems are solved, execute the following command:

$ bin/sql-client.sh embedded -d conf/sql-client-hive.yaml

Error reporting, error reporting, or error reporting:

Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory

Actually running sql-client.sh Before the script, you need to specify the path of the Hadoop environment's dependency package. It is recommended that you do not add one by mistake, unless some readers like it. Here I suggest a convenient way to set Hadoop classpath (can be added to ~ /. bashprofile) environment variable:

export HADOOP_CLASSPATH=`hadoop classpath`

Again:

$ bin/sql-client.sh embedded -d conf/sql-client-hive.yaml

Sorry, continue to report errors:

Caused by: org.apache.flink.table.client.gateway.SqlExecutionException: Could not create execution context. at org.apache.flink.table.client.gateway.local.ExecutionContext$Builder.build(ExecutionContext.java:753) at org.apache.flink.table.client.gateway.local.LocalExecutor.openSession(LocalExecutor.java:228) at org.apache.flink.table.client.SqlClient.start(SqlClient.java:98) at org.apache.flink.table.client.SqlClient.main(SqlClient.java:178) Caused by: org.apache.flink.table.catalog.exceptions.CatalogException: Failed to create Hive Metastore client

This is the version incompatibility between the Jar package of Hive 1.1.0 and Flink. The solution is as follows:

Download apache-hive-1.2.1
Replace the Hive Jar package in the Flink lib directory, delete hive-exec-1.1.0-cdh5.16.2.jar, hive-metastore-1.1.0-cdh5.16.2.jar and libfb303-0.9.3.jar, and then add hive-exec-1.2.1.jar, hive-metastore-1.2.1.jar and libfb303-0.9.2.jar. Check the Lib directory again:

$ tree lib lib ├── flink-connector-hive_2.11-1.10.0.jar ├── flink-dist_2.11-1.10.0.jar ├── flink-hadoop-compatibility_2.11-1.10.0.jar ├── flink-shaded-hadoop-2-2.6.0-cdh5.16.2-9.0.jar ├── flink-table_2.11-1.10.0.jar ├── flink-table-blink_2.11-1.10.0.jar ├── hive-exec-1.2.1.jar ├── hive-metastore-1.2.1.jar ├── libfb303-0.9.2.jar ├── log4j-1.2.17.jar └── slf4j-log4j12-1.7.15.jar

Finally:

$ bin/sql-client.sh embedded -d conf/sql-client-hive.yaml

At this time, readers can see the cute little squirrel holding chestnuts.

Flink SQL CLI practice

In Flink version 1.10 (currently RC1 phase), Flink community has made a lot of changes to SQL CLI, such as support for View, support for more data types and DDL statements, support for partition reading and writing, support for INSERT OVERWRITE, etc., which realizes more functions of TableEnvironment API and is more convenient for users to use.

Next, the author explains Flink SQL CLI in detail.

0. Help

$ bin/sql-client.sh embedded -d conf/sql-client-hive.yaml Flink SQL>

Execute HELP to view the commands supported by Flink SQL, as follows:

CREATE TABLE
DROP TABLE
CREATE VIEW
DESCRIBE
DROP VIEW
EXPLAIN
INSERT INTO
INSERT OVERWRITE
SELECT
SHOW FUNCTIONS
USE CATALOG
SHOW TABLES
SHOW DATABASES
SOURCE
USE
SHOW CATALOGS

1. Hive operation

■ 1.1 creating tables and importing data

In order to facilitate the experiment, the author uses SSB dbgen to generate test data, and the reader can also use the existing data of the test environment for the experiment.

For details on how to create tables and insert data in Hive, please refer to the author's earlier projects https://github.com/MLikeWater/ssb-kylin.

■ 1.2 Hive table

To view the Hive table created in the previous step:

0: jdbc:hive2://xx.xxx.xxx.xxx:10000> show tables; +--------------+--+ | tab_name | +--------------+--+ | customer | | dates | | lineorder | | p_lineorder | | part | | supplier | +--------------+--+

Readers can query Hive and compare the results of Flink SQL query.

2. Flink operation

■ 2.1 accessing Hive database through HiveCatalog

$ bin/sql-client.sh embedded -d conf/sql-client-hive.yaml Flink SQL> show catalogs; default_catalog staginghive Flink SQL> use catalog staginghive;

Get all configured catalogs through show catalogs. Because the author in SQL client- hive.yaml The default catalog, i.e. staging, is set in the file. If you need to switch to another catalog, you can use usecatalog xxx.

■ 2.2 querying Hive metadata

Query the Hive database and table through Flink SQL:

# query data base Flink SQL> show databases; ... ssb tmp ... Flink SQL> use ssb; # Query table Flink SQL> show tables; customer dates lineorder p_lineorder part supplier # Query table structure Flink SQL> DESCRIBE customer; root |-- c_custkey: INT |-- c_name: STRING |-- c_address: STRING |-- c_city: STRING |-- c_nation: STRING |-- c_region: STRING |-- c_phone: STRING |-- c_mktsegment: STRING

Note that Hive's metadata is used in lower case letters in Flink catalog.

■ 2.3 query

Next, query some SQL statements in Flink SQL CLI, the complete SQL reference https://github.com/MLikeWater/ssb-kylin README for.

Currently, when Flink SQL parses the metadata of Hive view, it will encounter some bugs, such as executing Q1.1 SQL:

Flink SQL> select sum(v_revenue) as revenue > from p_lineorder > left join dates on lo_orderdate = d_datekey > where d_year = 1993 > and lo_discount between 1 and 3 > and lo_quantity < 25; [ERROR] Could not execute SQL statement. Reason: org.apache.calcite.sql.validate.SqlValidatorException: Tabeorder' not found; did you mean 'LINEORDER'?

Flink SQL could not find the entity table in the view.

P_ The lineorder table is a view in Hive. The statement to create the table is as follows:

CREATE VIEW P_LINEORDER AS SELECT LO_ORDERKEY, LO_LINENUMBER, LO_CUSTKEY, LO_PARTKEY, LO_SUPPKEY, LO_ORDERDATE, LO_ORDERPRIOTITY, LO_SHIPPRIOTITY, LO_QUANTITY, LO_EXTENDEDPRICE, LO_ORDTOTALPRICE, LO_DISCOUNT, LO_REVENUE, LO_SUPPLYCOST, LO_TAX, LO_COMMITDATE, LO_SHIPMODE, LO_EXTENDEDPRICE*LO_DISCOUNT AS V_REVENUE FROM ssb.LINEORDER;

But for the definition of views in Hive, Flink SQL does not deal with metadata very well. In order to ensure the smooth execution of SQL, the author delete and rebuild this view in Hive:

0: jdbc:hive2://xx.xxx.xxx.xxx:10000> create view p_lineorder as select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_suppkey, lo_orderdate, lo_orderpriotity, lo_shippriotity, lo_quantity, lo_extendedprice, lo_ordtotalprice, lo_discount, lo_revenue, lo_supplycost, lo_tax, lo_commitdate, lo_shipmode, lo_extendedprice*lo_discount as v_revenue from ssb.lineorder;

Then continue to query Q1.1 SQL in Flink SQL CLI:

Flink SQL> select sum(v_revenue) as revenue > from p_lineorder > left join dates on lo_orderdate = d_datekey > where d_year = 1993 > and lo_discount between 1 and 3 > and lo_quantity < 25; revenue 894280292647

Continue to query Q2.1 SQL:

Flink SQL> select sum(lo_revenue) as lo_revenue, d_year, p_brand > from p_lineorder > left join dates on lo_orderdate = d_datekey > left join part on lo_partkey = p_partkey > left join supplier on lo_suppkey = s_suppkey > where p_category = 'MFGR#12' and s_region = 'AMERICA' > group by d_year, p_brand > order by d_year, p_brand; lo_revenue d_year p_brand 819634128 1998 MFGR#1206 877651232 1998 MFGR#1207 754489428 1998 MFGR#1208 816369488 1998 MFGR#1209 668482306 1998 MFGR#1210 660366608 1998 MFGR#1211 862902570 1998 MFGR#1212 ...

Finally, query a Q4.3 SQL:

Flink SQL> select d_year, s_city, p_brand, sum(lo_revenue) - sum(lo_supplycost) as profit > from p_lineorder > left join dates on lo_orderdate = d_datekey > left join customer on lo_custkey = c_custkey > left join supplier on lo_suppkey = s_suppkey > left join part on lo_partkey = p_partkey > where c_region = 'AMERICA'and s_nation = 'UNITED STATES' > and (d_year = 1997 or d_year = 1998) > and p_category = 'MFGR#14' > group by d_year, s_city, p_brand > order by d_year, s_city, p_brand; d_year s_city p_brand profit 1998 UNITED ST9 MFGR#1440 6665681

If the readers are interested, they can query the remaining SQL and compare it with Spark SQL. In addition, Flink SQL also supports EXPLAIN to query the execution plan of SQL.

■ 2.4 creating views

Similarly, you can create and delete views in Flink SQL CLI as follows:

Flink SQL> create view p_lineorder2 as > select lo_orderkey, > lo_linenumber, > lo_custkey, > lo_partkey, > lo_suppkey, > lo_orderdate, > lo_orderpriotity, > lo_shippriotity, > lo_quantity, > lo_extendedprice, > lo_ordtotalprice, > lo_discount, > lo_revenue, > lo_supplycost, > lo_tax, > lo_commitdate, > lo_shipmode, > lo_extendedprice * lo_discount as v_revenue > from ssb.lineorder; [INFO] View has been created.

What I need to emphasize here is that currently Flink cannot delete the view in Hive:

Flink SQL> drop view p_lineorder; [ERROR] Could not execute SQL statement. Reason: The given view does not exist in the current CLI session. Only views created with a CREATE VIEW statement can be accessed.

■ 2.5 zone operation

Create a partition table in Hive database:

CREATE TABLE IF NOT EXISTS flink_partition_test ( id int, name string ) PARTITIONED BY (day string, type string) stored as textfile;

Next, insert and query data through Flink SQL:

# Insert data for static partition Flink SQL> INSERT INTO flink_partition_test PARTITION (type='Flink', `day`='2020-02-01') SELECT 100001, 'Flink001'; # query Flink SQL> select * from flink_partition_test; id name day type 100001 Flink001 2020-02-01 Flink # Insert dynamic partition Flink SQL> INSERT INTO flink_partition_test SELECT 100002, 'Spark', '2020-02-02', 'SparkSQL'; # query Flink SQL> select * from flink_partition_test; id name day type 100002 Spark 2020-02-02 SparkSQL 100001 FlinkSQL 2020-02-01 Flink # The combination of dynamic and static partitions is similar, no more demonstration # Overwrite insert data Flink SQL> INSERT OVERWRITE flink_partition_test PARTITION (type='Flink') SELECT 100002, 'Spark', '2020-02-08', 'SparkSQL-2.4'; id name day type 100002 Spark 2020-02-02 SparkSQL 100001 FlinkSQL 2020-02-01 Flink

The field day belongs to the keyword in Flink and needs special processing.

■ 2.6 other functions

2.6.1 functions

Flink SQL supports built-in and custom functions. For built-in functions, you can perform show functions to view them. In this section, I will introduce how to create custom functions separately later.

2.6.2 setting parameters

Flink SQL supports setting environment parameters. You can use the set command to view and set parameters:

Flink SQL> set; deployment.gateway-address= deployment.gateway-port=0 deployment.m=yarn-cluster deployment.response-timeout=5000 deployment.yjm=1024 deployment.yn=2 deployment.ys=5 deployment.ytm=2048 execution.current-catalog=staginghive execution.current-database=ssb execution.max-idle-state-retention=0 execution.max-parallelism=128 execution.max-table-result-rows=1000000 execution.min-idle-state-retention=0 execution.parallelism=1 execution.periodic-watermarks-interval=200 execution.planner=blink execution.restart-strategy.type=fallback execution.result-mode=table execution.time-characteristic=event-time execution.type=batch Flink SQL> set deployment.yjm = 2048;

summary

In this paper, I use Flink SQL to operate Hive database in detail, and some functions provided by Flink SQL.

Of course, there are still some problems when Flink SQL operates Hive database at present:

Currently, only TextFile storage format is supported, and no other storage format can be specified
Only tables with TextFile storage format in Hive database are supported, and row format serve is org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe . Although RCFile, ORC, Parquet, Sequence and other storage formats are implemented, the storage format of Hive table cannot be recognized automatically. If you want to use other storage formats, you need to modify the source code and recompile it. However, the community has tested these storage formats, and it is believed that they can be used in Flink SQL in the near future.
OpenCSVSerde support is incomplete
If the reader uses the row format serde of TextFile as org.apache.hadoop.hive.serde2.OpenCSVSerde cannot recognize the field type correctly. It will map all the fields in Hive table to String type.
Bucket table is not supported temporarily
ACID table is not supported temporarily
Less function in Flink SQL optimization
Authority control
Similar to Spark SQL in this respect, currently based on HDFS ACL control, Sentry or ranger control permission has not been implemented yet, but at present Cloudera is developing a strategy to set Spark SQL and Hive shared access permission based on Ranger, to achieve row / column level control and audit information.

The Flink community is growing rapidly, all of these problems are temporary, and they will be solved one by one with the release of the new version.

The running in period of Flink and Hive

review

The problem of incessant pruning and confusion

1. How Flink connects to Hive

2. The Hadoop environment cannot be identified or the configuration file cannot be found

3. Dependent package, class or method not found

Flink SQL CLI practice

0. Help

1. Hive operation

2. Flink operation

summary

25 May 2020, 03:57 | Views: 3665

Add new comment

0 comments