The first lesson of big data on the cloud: MaxCompute authorization and appearance operation hidden pit Guide

1, Sub account creation, AK information binding
If you are the first time to log in to digital plus platform and use DataWorks with a sub account, you need to confirm the following information:
• the business alias of the primary account to which the sub account belongs.
• user name and password of the sub account.
• AccessKey ID and AccessKey Secret of the sub account.
• confirm that the primary account has allowed the sub account to enable console login.
• confirm that the primary account has allowed the sub account to manage the AccessKey independently.
1. Sub account creation
(1) Create a sub account
(2) Bind AK information
(3) DataWorks given role
(1) Use the alicloud account (primary account) to log in to the RAM console.
(2) In the left navigation bar, under people management menu, click users.
(3) Click new user.
(4) Enter the login name and display name.
(5) In the access method area, select console password login.
(6) Click OK.

Explain:
(1) Click add user to create multiple RAM users at once.
(2) After the RAM user is created, be sure to save the user name and login password, and inform the sub account.

2. Create access key of RAM sub account
The access key is very important for the successful operation of tasks created by developers in DataWorks. It is different from the account and password filled in when logging in. It is mainly used for mutual authentication between alicloud products. Therefore, the primary account needs to create an AccessKey for the sub account. After the creation is successful, please ensure the security of AccessKey ID and AccessKey Secret as much as possible, and do not let others know. Once there is a risk of leakage, please disable and update in time. The operation key AK includes two parts: AccessKey ID and AccessKey Secret. If the cloud account allows ram users to manage the AccessKey independently, RAM users can also create their own AccessKey.
The operation of creating AccessKey for sub account is as follows.
(1) In the left navigation bar, under people management menu, click users.
(2) Under the user login name / display name list, click the target RAM user name.
(3) In the user AccessKey area, click create new AccessKey.
(4) Click OK.

Explain:
Fill in the mobile phone verification code when creating for the first time.
AccessKeySecret is only displayed when it is created. No query is provided. Please keep it properly. If the AccessKey is leaked or lost, you need to create a new AccessKey. You can create up to 2 AccessKey.

3. Authorize RAM sub account
If you need to enable a sub account to create a DataWorks workspace, you need to grant aliyundataworks full access to the sub account.
(1) In the left navigation bar, under people management menu, click users.
(2) Under the user login name / display name list, locate the target RAM user.
(3) Click Add permission, and the authorized principal will be filled in automatically.
(4) Under the right policy name list on the left, click the right policy you want to grant to RAM users.
(5) Click OK.
(6) Click finish.

Note: in the area box on the right, select a policy and click x to undo it.

2, Sub account production environment creation function, access resource authorization, OSS external table authorization
1. Account production environment creation function, access resource authorization
After the sub account logs in to the DataWorks console, click workspace management, and the member management will give the sub account a corresponding role. The permissions corresponding to each role can be viewed in the permission list in the workspace management interface. The member roles added here are isolated from the production environment. Here's how to create functions and access resource authorization in the production environment.
(1) Create a new role and authorize the role.

create role worker;
Role assignment: grant worker TO ram $Jianwei MaxCompute:banzhan; (RAM $Jianwei MaxCompute:banzhan is the ram account)
Role authorization: grant createinstance, createresource, createfunction, createtable, list on project Wei wwww to role worker;

(2) Create UDF functions.

CREATE FUNCTION banzha_udf as 'com.aliyun.udf.test.UDF_DEMO' using '1218.jar';

The prerequisite is that the 1818.jar package has been uploaded. Upload resources in combination with the video of the first lesson.
2. OSS access authorization
MaxCompute needs to directly access OSS data, provided that you need to assign OSS data related permissions to MaxCompute's access account. If there is no corresponding authorization creation, an error will be reported when creating an external table as follows:

At this time, we need to authorize access to OSS
There are two ways of authorization:
(1) When the Owner of MaxCompute and OSS are the same account, you can directly log in to alicloud account and click here to complete one click authorization. One click authorization , we can add the management object storage service (OSS) permission (AliyunOSSFullAccess) to the changed sub account in access control.
(2) Custom authorization
a. Add a RAM role OSS admin
b. Modify role policy content settings

--When the Owner of MaxCompute and OSS are the same account, the settings are as follows.
{
"Statement": [
{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": [
    "odps.aliyuncs.com"
]
}
}
],
"Version": "1"
}


--When the Owner of MaxCompute and OSS is not the same account, the settings are as follows.
{
"Statement": [
{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": [
    "MaxCompute's Owner cloud account id@odps.aliyuncs.com"
]
}
}
],
"Version": "1"
}

c. Grant roles the necessary permissions to access OSS AliyunODPSRolePolicy

{
"Version": "1",
"Statement": [
{
"Action": [
"oss:ListBuckets",
"oss:GetObject",
"oss:ListObjects",
"oss:PutObject",
"oss:DeleteObject",
"oss:AbortMultipartUpload",
"oss:ListParts"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
--Other permissions can be customized.

d. The permission AliyunODPSRolePolicy is authorized to the role.
3, OSS external table creation guide
1. Introduction to syntax format of external table creation
(1) External table creation example:

CREATE EXTERNAL TABLE IF NOT EXISTS fc_rcfile
(   `id` int,  
    `name` string
)
PARTITIONED BY ( `time_ds` string)
ROW FORMAT SERDE  
'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'  
STORED AS RCFILE
LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/oss-huabei2/jianwei/fc_rcfile/';

(2) LOCATION description
LOCATION must specify an OSS directory. By default, the system will read all files in this directory.
It is recommended that you use the intranet domain name provided by OSS, otherwise OSS traffic fees will be incurred.
Access to OSS external tables. Currently, the use of external Endpoint is not supported.
At present, the single file size of STORE AS cannot exceed 3G. If the file size is too large, split is recommended.
It is recommended that the area where OSS data is stored corresponds to the area where MaxCompute is enabled. Since MaxCompute is only deployed in some regions, we do not promise cross region data connectivity.

OSS The connection format of is oss://oss-cn-shanghai-internal.aliyuncs.com/Bucket name / directory name /. Do not add the file name after the directory. The following is the wrong usage.
http://oss-odps-test.oss-cn-shanghai-internal.aliyuncs.com/Demo/ -- http connection is not supported.
https://oss-odps-test.oss-cn-shanghai-internal.aliyuncs.com/Demo/ -- https connection is not supported.
oss://oss-odps-test.oss-cn-shanghai-internal.aliyuncs.com/Demo -- wrong connection address.
oss://oss://oss-cn-shanghai-internal.aliyuncs.com/oss-odps-test/Demo/vehicle.csv  
-- You do not have to specify a file name.

(3) External table creation format description
The syntax format is quite similar to Hive's, but you need to pay attention to the following problems.

a.STORED AS Keyword, which is not used to create a normal unstructured external table in this syntax format STORED BY Keyword, which is currently unique when reading open source compatible data. STORED AS Followed by the file format name, for example ORC/PARQUET/RCFILE/SEQUENCEFILE/TEXTFILE And so on.
b.Exterior table column schemas Must be specific OSS Of data stored on schema Conforms to this.
c.ROW FORMAT SERDE: It is not a required option, but only in some special formats, such as TEXTFILE Only need to be used.
d.WITH SERDEPROPERTIES: When correlation OSS privilege use STS This parameter is required for mode authorization odps.properties.rolearn Property with value RAM Used in Role Of Arn Information. You can configure the STORED AS <file format>At the same time<serde class>Explain file format File format.

//Take the ORC file format as an example, as shown below.
CREATE EXTERNAL TABLE [IF NOT EXISTS] <external_table>
(<column schemas>)
[PARTITIONED BY (partition column schemas)]
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES ('odps.properties.rolearn'='${roleran}'
STORED AS ORC
LOCATION 'oss://${endpoint}/${bucket}/${userfilePath}/'

e.Different file format Corresponding serde class As follows:
• ALIORC: com.aliyun.apsara.serde.AliOrcSerDe
• SEQUENCEFILE: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
• TEXTFILE: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
• RCFILE: org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe
• ORC: org.apache.hadoop.hive.ql.io.orc.OrcSerde
• ORCFILE: org.apache.hadoop.hive.ql.io.orc.OrcSerde
• PARQUET: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
• AVRO: org.apache.hadoop.hive.serde2.avro.AvroSerDe

(4) Using Arn and AK authentication methods to create an example
a. Use the Arn information of Role in RAM to create an external table
When the OSS permission is associated with STS mode authorization, this parameter is required to specify the odps.properties.rolearn property. The property value is the Arn information of the specific Role used in RAM.

WITH SERDEPROPERTIES ('odps.properties.rolearn'='${roleran}'

An example is as follows:

CREATE EXTERNAL TABLE IF NOT EXISTS fc_csv
(
vehicleId string,
recordId string,
patientId string,
calls string,
locationLatitute string,
locationLongtitue string,
recordTime string,
direction string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
    ('separatorChar'=',',
    'odps.properties.rolearn'='acs:ram::1928466352305391:role/oss-admin'
    )
STORED AS TEXTFILE
LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/oss-odps-bucket/extra_test/fc_csv';

b. Plain AK creates external tables (not recommended)
If you do not use STS mode authorization, you do not need to specify the odps.properties.rolearn property, and directly pass in the plaintext AccessKeyId and AccessKeySecret in Location.
If Location is associated with OSS, clear text AK should be used, as shown below.


LOCATION 'oss://${accessKeyId}:${accessKeySecret}@${endpoint}/${bucket}/${userPath}/'

An example is as follows:

CREATE EXTERNAL TABLE IF NOT EXISTS fc_csv1
(
vehicleId string,
recordId string,
patientId string,
calls string,
locationLatitute string,
locationLongtitue string,
recordTime string,
direction string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
    ('separatorChar'=',',
    'odps.properties.rolearn'='acs:ram::1928466352305391:role/oss-admin'
    )
STORED AS TEXTFILE
-- LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/oss-odps-bucket/extra_test/fc_csv';
LOCATION 'oss://LTAI4FfgVEQQwsNQ*******:J8FGZaoj2CMcunFrVn1FrL*****wM@oss-cn-beijing-internal.aliyuncs.com/oss-odps-bucket/extra_test/fc_csv';

2. Create an external table of type Rcfile
(1) Query HIVE table schema

show create table fc_rcfile;**

The results are as follows:

CREATE TABLE `fc_rcfile`(
`id` int,  
`name` string)
PARTITIONED BY (  
`time_ds` string)
ROW FORMAT SERDE  
'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'  
STORED AS INPUTFORMAT  
'org.apache.hadoop.hive.ql.io.RCFileInputFormat'  
OUTPUTFORMAT  
'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-138804:9000/user/hive/warehouse/extra_demo.db/fc_rcfile'

(2) Create an external table in MaxCompute

CREATE EXTERNAL TABLE IF NOT EXISTS fc_rcfile
(  `id` int,  
`name` string)
PARTITIONED BY ( `time_ds` string)
ROW FORMAT SERDE  
'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'  
STORED AS RCFILE
LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/oss-huabei2/jianwei/fc_rcfile/'; 

(3) Add partition

alter table fc_rcfile ADD PARTITION (time_ds = '20191209') ;
alter table fc_rcfile ADD PARTITION (time_ds = '20191210') ;
alter table fc_rcfile ADD PARTITION (time_ds = '20191211') ; Batch create partition can be used with parameters MMA tool

(4) Query data

select * from fc_rcfile where time_ds = '20191209' ;
select * from fc_rcfile where time_ds = '20191210' ;
select * from fc_rcfile where time_ds = '20191211' ;

3. Create an external table of type Json
(1) Create an external table of type Json

CREATE EXTERNAL TABLE `student`(
`student` map<string,string> COMMENT 'from deserializer',  
`class` map<string,string> COMMENT 'from deserializer',  
`teacher` map<string,string> COMMENT 'from deserializer')
COMMENT 'Student course information'
ROW FORMAT SERDE  
'org.apache.hive.hcatalog.data.JsonSerDe'  
STORED AS TEXTFILE
LOCATION
'oss://oss-cn-beijing-internal.aliyuncs.com/oss-huabei2/jianwei/student'

(2) Upload the Json file data in the corresponding OSS console bucket.
(3) Querying data from external tables
The error message is as follows:

Solution: you need to set the hive compatible flag on.

set odps.sql.hive.compatible=true;

Query the data again to return the Json data correctly.

The following is the data queried in Hive. You can see that the two data are consistent.

4. Create an external table in CSV format
(1) Create an external table in CSV format

 An example of table creation statement is as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS fc_csv
(
vehicleId string,
recordId string,
patientId string,
calls string,
locationLatitute string,
locationLongtitue string,
recordTime string,
direction string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
    ('separatorChar'=','
    )
STORED AS TEXTFILE
LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/oss-huabei2/jianwei/fc_csv';

(2) Query data

set odps.sql.hive.compatible=true;
select *  from fc_csv;

If you do not add Hive compatible flag settings, you will find the following error messages:

FAILED: ODPS-0123131:User defined function exception - internal error - Fatal Error Happended



5. Create an external table in compressed format
field.delim is required for column separators when creating external tables. Selecting delimiter will result in an error or the data is not segmented as expected. There are two ways to create external tables.
The property flag described below needs to be set.

(1) Create external table

a.The column separator is defined as: delimiter

drop TABLE if exists oss_gzip;
CREATE EXTERNAL TABLE IF NOT EXISTS  `oss_gzip` (
`userid` string,
`job` string,
`education` string,
`region` string
) 
PARTITIONED BY (dt STRING COMMENT 'date')
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ( 
'delimiter'='\t',
'odps.text.option.gzip.input.enabled'='true', 
'odps.text.option.gzip.output.enabled'='true'
)
STORED AS TEXTFILE
LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/oss-huabei2/jianwei/gzipfile/';

When querying the data, we will find that the data is not divided according to our separator, as shown in the following figure:

b.The column separator is defined as: field.delim

drop TABLE if exists oss_gzip2;
CREATE EXTERNAL TABLE IF NOT EXISTS  `oss_gzip2` (
`userid` string,
`job` string,
`education` string,
`region` string
) 
PARTITIONED BY (dt STRING COMMENT 'date')
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ( 
'field.delim'='\t',
'odps.text.option.gzip.input.enabled'='true', 
'odps.text.option.gzip.output.enabled'='true'
)
STORED AS TEXTFILE
LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/oss-huabei2/jianwei/gzipfile/';


Note: use field.delim when defining column separator, and cannot use delimiter

6. Create an external table with a new data type
When the external table creation field involves a new data type, you need to turn on the new type flag.

set odps.sql.type.system.odps2=true;

Otherwise, the following error will be reported:

set odps.sql.type.system.odps2=true ;
drop TABLE if exists oss_gzip3;
CREATE EXTERNAL TABLE IF NOT EXISTS  `oss_gzip3` (
`userid` FLOAT ,
`job` string,
`education` string,
`region` VARCHAR(20)
) 
PARTITIONED BY (dt STRING COMMENT 'date')
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ( 
'field.delim'='\t',
'odps.text.option.gzip.input.enabled'='true', 
'odps.text.option.gzip.output.enabled'='true'
)
STORED AS TEXTFILE
LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/oss-huabei2/jianwei/flag_file/';

//Add corresponding partition:
alter table oss_gzip3 add if NOT EXISTS  partition(dt='20191224');

//Query data:
select * from oss_gzip3  where dt='20191224';

4, Using Information Schema metadata to view the operation behavior and expense calculation of project and table
1. Main account installation package
Before using, you need to install the permission package of Information Schema as the Project Owner to obtain the permission to access the metadata of this project.
The following error is that the permission package and sub account of the corresponding Information Schema are not installed, and they do not have relevant permissions

There are two ways to install Information Schema permission package:
(1) Execute the following command in the MaxCompute command line tool (odpscmd).

odps@myproject1>install package information_schema.systables;

(2) Execute the following statement in Data Development > temporary query in DataWorks.

install package information_schema.systables;

2. Authorize sub account

grant read on package information_schema.systables to role worker;

3. Query metadata information

select * from information_schema.tasks_history limit 20;


The tasks? History field column information is as follows:


4. Calculate SQL cost through tasks & History
SQL task charging by volume: for each SQL job you execute, MaxCompute will charge based on the input data of the job and the complexity of the SQL. The fee is generated after the SQL execution is completed, and a one-time billing settlement is made the next day.

The billing formula of the developer version of SQL calculation task is:
One time SQL calculation cost = calculated input data * unit price (0.15 yuan / GB)

The billing formula for standard SQL calculation tasks is:
One time SQL calculation cost = calculation input data amount * SQL complexity * unit price (0.3 yuan / GB)

Pay as you go SQL calculation fee = calculate input data amount * SQL complexity * unit price (0.3 yuan / GB)

Calculate input data amount: refers to the amount of data actually scanned by an SQL statement. Most SQL statements have partition filtering and column clipping, so in general, this value is far smaller than the data size of the source table.
In information ﹣ schema.tasks ﹣ history, the field input ﹣ bytes is the amount of data actually scanned, that is, the amount of input data for our calculation. The field complexity is SQL complexity. So we can calculate the SQL cost according to the following formula.

 

 

Original link

This article is Alibaba cloud content and cannot be reproduced without permission.

Tags: Big Data hive Apache SQL Hadoop

Posted on Mon, 13 Jan 2020 02:57:38 -0500 by majik_sheff