The running in period of Flink and Hive

There is a lot of feedback from readers. Please refer to the previous article< Hive is finally waiting, Flink >When Flink is deployed and Hive is integrated, there are some bug s and compatibility problems. Although waiting, it is not available. So I added this article as a sister article.

review

In the previous article, the author used the CDH version of 5.16.2, of which the Hive version is 1.1.0 (the Hive version of CDH 5.x series is not higher than 1.1.0, is it incomprehensible or not), and the Flink source code itself does not have good compatibility with Hive version 1.1.0, and there are many problems. In order to be compatible with the current version, the author modified the Flink code based on CDH 5.16.2 environment, repackaged and deployed it.

In fact, after many open source projects, such as Apache Atlas, Apache Spark, Hive 1.2.x and Hive 1.1.x, in most cases, replacing some Jar packages can solve the problem of compatibility. For my environment, some Jar packages of Hive 1.2.1 can be used instead of Hive 1.1.0. In the beginning of this article, the author will solve this problem, and then add the missing content of the last article.

The problem of incessant pruning and confusion

According to the reader's feedback, the author summarizes all the questions into three categories:

  1. How can Flink connect to Hive? Besides the API, is there any similar spark SQL command
  2. The Hadoop environment is not recognized or the configuration file is not found
  3. Dependent package, class or method not found

1. How Flink connects to Hive

Some readers don't know how to configure the Flink to connect to Hive's Catalog. Here is a complete conf / SQL client- Hive.yaml Example:

catalogs:
- name: staginghive
     type: hive
     hive-conf-dir: /etc/hive/conf
     hive-version: 1.2.1

execution:
  planner: blink
  type: batch
  time-characteristic: event-time
  periodic-watermarks-interval: 200
  result-mode: table
  max-table-result-rows: 1000000
  parallelism: 1
  max-parallelism: 128
  min-idle-state-retention: 0
  max-idle-state-retention: 0
  current-catalog: staginghive
  current-database: ssb
  restart-strategy:
    type: fallback

deployment:
  response-timeout: 5000
  gateway-address: ""
  gateway-port: 0
  m: yarn-cluster
  yn: 2
  ys: 5
  yjm: 1024
  ytm: 2048

sql-client-hive.yaml The configuration file contains:

  1. Hive profile paths are configured in hive profile catalogs.
  2. The configuration information of Yarn is configured in the deployment of Yarn configuration information.
  3. The execution engine information execution configures the blink planner and uses the batch mode. The batch mode is relatively stable, suitable for traditional batch processing jobs, and can be fault-tolerant. In addition, it is recommended to enable the compression function when the intermediate data is dropped. In addition to batch, Flink also supports streaming mode.

■ Flink SQL CLI tools

Similar to spark SQL command, Flink provides SQL CLI tool, namely sql-client.sh script. In Flink 1.10, Flink SQL CLI has improved many functions, which I will explain later.

sql-client.sh Use as follows:

$ bin/sql-client.sh embedded -d conf/sql-client-hive.yaml

2. The Hadoop environment cannot be identified or the configuration file cannot be found

The author mentioned in the previous article that to deploy CDH gateway on the environment where Flink is deployed, including Hadoop and Hive clients, in addition, some environment variables need to be configured, as follows:

export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive
export HIVE_CONF_DIR=/etc/hive/conf

3. Dependent package, class or method not found

First, check the lib directory in the Flink home directory:

$ tree  lib
lib
├── flink-connector-hive_2.11-1.10.0.jar
├── flink-dist_2.11-1.10.0.jar
├── flink-hadoop-compatibility_2.11-1.10.0.jar
├── flink-shaded-hadoop-2-2.6.0-cdh5.16.2-9.0.jar
├── flink-table_2.11-1.10.0.jar
├── flink-table-blink_2.11-1.10.0.jar
├── hive-exec-1.1.0-cdh5.16.2.jar
├── hive-metastore-1.1.0-cdh5.16.2.jar
├── libfb303-0.9.3.jar
├── log4j-1.2.17.jar
└── slf4j-log4j12-1.7.15.jar

If the first two problems are solved, execute the following command:

$ bin/sql-client.sh embedded -d conf/sql-client-hive.yaml

Error reporting, error reporting, or error reporting:

Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory

Actually running sql-client.sh Before the script, you need to specify the path of the Hadoop environment's dependency package. It is recommended that you do not add one by mistake, unless some readers like it. Here I suggest a convenient way to set Hadoop classpath (can be added to ~ /. bashprofile) environment variable:

export HADOOP_CLASSPATH=`hadoop classpath`

Again:

$ bin/sql-client.sh embedded -d conf/sql-client-hive.yaml

Sorry, continue to report errors:

Caused by: org.apache.flink.table.client.gateway.SqlExecutionException: Could not create execution context. at org.apache.flink.table.client.gateway.local.ExecutionContext$Builder.build(ExecutionContext.java:753) at org.apache.flink.table.client.gateway.local.LocalExecutor.openSession(LocalExecutor.java:228) at org.apache.flink.table.client.SqlClient.start(SqlClient.java:98) at org.apache.flink.table.client.SqlClient.main(SqlClient.java:178) Caused by: org.apache.flink.table.catalog.exceptions.CatalogException: Failed to create Hive Metastore client

This is the version incompatibility between the Jar package of Hive 1.1.0 and Flink. The solution is as follows:

  1. Download apache-hive-1.2.1
  2. Replace the Hive Jar package in the Flink lib directory, delete hive-exec-1.1.0-cdh5.16.2.jar, hive-metastore-1.1.0-cdh5.16.2.jar and libfb303-0.9.3.jar, and then add hive-exec-1.2.1.jar, hive-metastore-1.2.1.jar and libfb303-0.9.2.jar. Check the Lib directory again:
$ tree lib
lib
├── flink-connector-hive_2.11-1.10.0.jar
├── flink-dist_2.11-1.10.0.jar
├── flink-hadoop-compatibility_2.11-1.10.0.jar
├── flink-shaded-hadoop-2-2.6.0-cdh5.16.2-9.0.jar
├── flink-table_2.11-1.10.0.jar
├── flink-table-blink_2.11-1.10.0.jar
├── hive-exec-1.2.1.jar
├── hive-metastore-1.2.1.jar
├── libfb303-0.9.2.jar
├── log4j-1.2.17.jar
└── slf4j-log4j12-1.7.15.jar

Finally:

$ bin/sql-client.sh embedded -d conf/sql-client-hive.yaml

At this time, readers can see the cute little squirrel holding chestnuts.

Flink SQL CLI practice

In Flink version 1.10 (currently RC1 phase), Flink community has made a lot of changes to SQL CLI, such as support for View, support for more data types and DDL statements, support for partition reading and writing, support for INSERT OVERWRITE, etc., which realizes more functions of TableEnvironment API and is more convenient for users to use.

Next, the author explains Flink SQL CLI in detail.

0. Help

Log in to the Flink SQL client by executing the following command:

$ bin/sql-client.sh embedded -d conf/sql-client-hive.yaml
Flink SQL>

Execute HELP to view the commands supported by Flink SQL, as follows:

  • CREATE TABLE
  • DROP TABLE
  • CREATE VIEW
  • DESCRIBE
  • DROP VIEW
  • EXPLAIN
  • INSERT INTO
  • INSERT OVERWRITE
  • SELECT
  • SHOW FUNCTIONS
  • USE CATALOG
  • SHOW TABLES
  • SHOW DATABASES
  • SOURCE
  • USE
  • SHOW CATALOGS

1. Hive operation

■ 1.1 creating tables and importing data

In order to facilitate the experiment, the author uses SSB dbgen to generate test data, and the reader can also use the existing data of the test environment for the experiment.

For details on how to create tables and insert data in Hive, please refer to the author's earlier projects https://github.com/MLikeWater/ssb-kylin.

■ 1.2 Hive table

To view the Hive table created in the previous step:

0: jdbc:hive2://xx.xxx.xxx.xxx:10000> show tables;
+--------------+--+
|   tab_name   |
+--------------+--+
| customer     |
| dates        |
| lineorder    |
| p_lineorder  |
| part         |
| supplier     |
+--------------+--+

Readers can query Hive and compare the results of Flink SQL query.

2. Flink operation

■ 2.1 accessing Hive database through HiveCatalog

Log in to Flink SQL CLI and query catalogs:

$ bin/sql-client.sh embedded -d conf/sql-client-hive.yaml
Flink SQL> show catalogs;
default_catalog
staginghive

Flink SQL> use catalog staginghive;

Get all configured catalogs through show catalogs. Because the author in SQL client- hive.yaml The default catalog, i.e. staging, is set in the file. If you need to switch to another catalog, you can use usecatalog xxx.

■ 2.2 querying Hive metadata

Query the Hive database and table through Flink SQL:

# query data base
Flink SQL> show databases;
...
ssb
tmp
...
Flink SQL> use ssb;

# Query table
Flink SQL> show tables;
customer
dates
lineorder
p_lineorder
part
supplier

# Query table structure
Flink SQL> DESCRIBE customer;
root
|-- c_custkey: INT
|-- c_name: STRING
|-- c_address: STRING
|-- c_city: STRING
|-- c_nation: STRING
|-- c_region: STRING
|-- c_phone: STRING
|-- c_mktsegment: STRING

Note that Hive's metadata is used in lower case letters in Flink catalog.

■ 2.3 query

Next, query some SQL statements in Flink SQL CLI, the complete SQL reference https://github.com/MLikeWater/ssb-kylin README for.

Currently, when Flink SQL parses the metadata of Hive view, it will encounter some bugs, such as executing Q1.1 SQL:

Flink SQL> select sum(v_revenue) as revenue
> from p_lineorder
> left join dates on lo_orderdate = d_datekey
> where d_year = 1993
> and lo_discount between 1 and 3
> and lo_quantity < 25;

[ERROR] Could not execute SQL statement. Reason:
org.apache.calcite.sql.validate.SqlValidatorException: Tabeorder' not found; did you mean 'LINEORDER'?

Flink SQL could not find the entity table in the view.

P_ The lineorder table is a view in Hive. The statement to create the table is as follows:

CREATE VIEW P_LINEORDER AS
SELECT LO_ORDERKEY,
LO_LINENUMBER,
LO_CUSTKEY,
LO_PARTKEY,
LO_SUPPKEY,
LO_ORDERDATE,
LO_ORDERPRIOTITY,
LO_SHIPPRIOTITY,
LO_QUANTITY,
LO_EXTENDEDPRICE,
LO_ORDTOTALPRICE,
LO_DISCOUNT,
LO_REVENUE,
LO_SUPPLYCOST,
LO_TAX,
LO_COMMITDATE,
LO_SHIPMODE,
LO_EXTENDEDPRICE*LO_DISCOUNT AS V_REVENUE
FROM ssb.LINEORDER;

But for the definition of views in Hive, Flink SQL does not deal with metadata very well. In order to ensure the smooth execution of SQL, the author delete and rebuild this view in Hive:

0: jdbc:hive2://xx.xxx.xxx.xxx:10000> create view p_lineorder as
select lo_orderkey,
lo_linenumber,
lo_custkey,
lo_partkey,
lo_suppkey,
lo_orderdate,
lo_orderpriotity,
lo_shippriotity,
lo_quantity,
lo_extendedprice,
lo_ordtotalprice,
lo_discount,
lo_revenue,
lo_supplycost,
lo_tax,
lo_commitdate,
lo_shipmode,
lo_extendedprice*lo_discount as v_revenue
from ssb.lineorder;

Then continue to query Q1.1 SQL in Flink SQL CLI:

Flink SQL> select sum(v_revenue) as revenue
> from p_lineorder
> left join dates on lo_orderdate = d_datekey
> where d_year = 1993
> and lo_discount between 1 and 3
> and lo_quantity < 25;

revenue
894280292647

Continue to query Q2.1 SQL:

Flink SQL> select sum(lo_revenue) as lo_revenue, d_year, p_brand
> from p_lineorder
> left join dates on lo_orderdate = d_datekey
> left join part on lo_partkey = p_partkey
> left join supplier on lo_suppkey = s_suppkey
> where p_category = 'MFGR#12' and s_region = 'AMERICA'
> group by d_year, p_brand
> order by d_year, p_brand;

lo_revenue  d_year p_brand
819634128   1998   MFGR#1206
877651232   1998   MFGR#1207
754489428   1998   MFGR#1208
816369488   1998   MFGR#1209
668482306   1998   MFGR#1210
660366608   1998   MFGR#1211
862902570   1998   MFGR#1212
...

Finally, query a Q4.3 SQL:

Flink SQL> select d_year, s_city, p_brand, sum(lo_revenue) - sum(lo_supplycost) as profit
> from p_lineorder
> left join dates on lo_orderdate = d_datekey
> left join customer on lo_custkey = c_custkey
> left join supplier on lo_suppkey = s_suppkey
> left join part on lo_partkey = p_partkey
> where c_region = 'AMERICA'and s_nation = 'UNITED STATES'
> and (d_year = 1997 or d_year = 1998)
> and p_category = 'MFGR#14'
> group by d_year, s_city, p_brand
> order by d_year, s_city, p_brand;

d_year  s_city       p_brand       profit
1998    UNITED ST9   MFGR#1440     6665681

If the readers are interested, they can query the remaining SQL and compare it with Spark SQL. In addition, Flink SQL also supports EXPLAIN to query the execution plan of SQL.

■ 2.4 creating views

Similarly, you can create and delete views in Flink SQL CLI as follows:

Flink SQL> create view p_lineorder2 as
> select lo_orderkey,
> lo_linenumber,
> lo_custkey,
> lo_partkey,
> lo_suppkey,
> lo_orderdate,
> lo_orderpriotity,
> lo_shippriotity,
> lo_quantity,
> lo_extendedprice,
> lo_ordtotalprice,
> lo_discount,
> lo_revenue,
> lo_supplycost,
> lo_tax,
> lo_commitdate,
> lo_shipmode,
> lo_extendedprice * lo_discount as v_revenue
> from ssb.lineorder;
[INFO] View has been created.

What I need to emphasize here is that currently Flink cannot delete the view in Hive:

Flink SQL> drop view p_lineorder;
[ERROR] Could not execute SQL statement. Reason:
The given view does not exist in the current CLI session. Only views created with a CREATE VIEW statement can be accessed.

■ 2.5 zone operation

Create a partition table in Hive database:

CREATE TABLE IF NOT EXISTS flink_partition_test (
  id  int,
  name string
) PARTITIONED BY (day string, type string)
stored as textfile;

Next, insert and query data through Flink SQL:

# Insert data for static partition
Flink SQL> INSERT INTO flink_partition_test PARTITION (type='Flink', `day`='2020-02-01') SELECT 100001, 'Flink001';

# query
Flink SQL> select * from flink_partition_test;

id       name       day            type
100001   Flink001   2020-02-01     Flink


# Insert dynamic partition
Flink SQL> INSERT INTO flink_partition_test SELECT 100002, 'Spark', '2020-02-02', 'SparkSQL';


# query
Flink SQL> select * from flink_partition_test;

id        name          day           type
100002    Spark         2020-02-02    SparkSQL
100001    FlinkSQL      2020-02-01    Flink


# The combination of dynamic and static partitions is similar, no more demonstration
# Overwrite insert data
Flink SQL> INSERT OVERWRITE flink_partition_test PARTITION (type='Flink') SELECT 100002, 'Spark', '2020-02-08', 'SparkSQL-2.4';

id      name        day             type
100002  Spark       2020-02-02      SparkSQL
100001  FlinkSQL    2020-02-01      Flink

The field day belongs to the keyword in Flink and needs special processing.

■ 2.6 other functions

  • 2.6.1 functions

Flink SQL supports built-in and custom functions. For built-in functions, you can perform show functions to view them. In this section, I will introduce how to create custom functions separately later.

  • 2.6.2 setting parameters

Flink SQL supports setting environment parameters. You can use the set command to view and set parameters:

Flink SQL> set;
deployment.gateway-address=
deployment.gateway-port=0
deployment.m=yarn-cluster
deployment.response-timeout=5000
deployment.yjm=1024
deployment.yn=2
deployment.ys=5
deployment.ytm=2048
execution.current-catalog=staginghive
execution.current-database=ssb
execution.max-idle-state-retention=0
execution.max-parallelism=128
execution.max-table-result-rows=1000000
execution.min-idle-state-retention=0
execution.parallelism=1
execution.periodic-watermarks-interval=200
execution.planner=blink
execution.restart-strategy.type=fallback
execution.result-mode=table
execution.time-characteristic=event-time
execution.type=batch

Flink SQL> set deployment.yjm = 2048;

summary

In this paper, I use Flink SQL to operate Hive database in detail, and some functions provided by Flink SQL.

Of course, there are still some problems when Flink SQL operates Hive database at present:

  • Currently, only TextFile storage format is supported, and no other storage format can be specified
    Only tables with TextFile storage format in Hive database are supported, and row format serve is org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe . Although RCFile, ORC, Parquet, Sequence and other storage formats are implemented, the storage format of Hive table cannot be recognized automatically. If you want to use other storage formats, you need to modify the source code and recompile it. However, the community has tested these storage formats, and it is believed that they can be used in Flink SQL in the near future.
  • OpenCSVSerde support is incomplete
    If the reader uses the row format serde of TextFile as org.apache.hadoop.hive.serde2.OpenCSVSerde cannot recognize the field type correctly. It will map all the fields in Hive table to String type.
  • Bucket table is not supported temporarily
  • ACID table is not supported temporarily
  • Less function in Flink SQL optimization
  • Authority control
    Similar to Spark SQL in this respect, currently based on HDFS ACL control, Sentry or ranger control permission has not been implemented yet, but at present Cloudera is developing a strategy to set Spark SQL and Hive shared access permission based on Ranger, to achieve row / column level control and audit information.

The Flink community is growing rapidly, all of these problems are temporary, and they will be solved one by one with the release of the new version.

Tags: Mobile SQL hive Hadoop Apache

Posted on Mon, 25 May 2020 03:57:37 -0400 by soccerstar_23