Use of hide function (NVL / case when then else end / concat / concat_ws / expand)

System built-in function

1) View the functions provided by the system

desc functions;

2) Display the usage of the built-in function: it shows the meaning of nvl function

desc function nvl;

3) Displays the usage of the built-in function in detail
Show the meaning of nvl function and give examples

desc function extended nvl;

Common built-in functions

Empty field assignment

1) Function description
NVL: assign a value to the data whose value is NULL. Its format is NVL (value, default_value). Its function is to return default if value is NULL_ Value, otherwise return the value of value. If both parameters are NULL, return NULL.
2) Data preparation: use employee table
3) Query: if the comm of the employee is NULL, use - 1 instead

4) Data preparation:
Employee table:

7369	SMITH	CLERK	7902	1980-12-17 800.00	20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.00	300.00 30
7521	WARD	SALESMAN	7698	1981-2-22	1250.00	500.00 30
7566	JONES	MANAGER 7839	1981-4-2	2975.00	20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.00	1400.00 30
7698	BLAKE	MANAGER 7839	1981-5-1	2850.00	30
7782	CLARK	MANAGER 7839	1981-6-9	2450.00	10
7788	SCOTT	ANALYST 7566	1987-4-19	3000.00	20
7839	KING	PRESIDENT	1981-11-17 5000.00	10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.00	0.00	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.00	20
7900	JAMES	CLERK	7698	1981-12-3	950.00	30
7902	FORD	ANALYST 7566	1981-12-3	3000.00	20
7934	MILLER	CLERK	7782	1982-1-23	1300.00	10

Create emp table:

create table if not exists emp( empno int,
ename string, job string, mgr int,
hiredate string, sal double,
comm double, deptno int)
row format delimited fields terminated by '\t';

Query results:

hive (default)> select comm,nvl(comm,-1) from emp;
comm	_c1
NULL	-1.0
300.0	300.0
500.0	500.0
20.0	20.0
1400.0	1400.0
30.0	30.0
NULL	-1.0
NULL	-1.0
NULL	-1.0
0.0		0.0
20.0	20.0
30.0	30.0
NULL	-1.0
10.0	10.0

If comm is null, use - 1 instead.


(1) Data preparation

name of a fictitious monkey with supernatural powers	A	male
 sea	A	male
 Song song	B	male
 Miss Luo Yu feng	A	female
 Sister Ting	B	female
 Tingting	B	female

(2) Demand
Find out the number of men and women in different departments. The results are as follows:

dept_Id	        male	    female
A	            2	    1  
B	            1	    2

Count the number of men and women in department A and department B. at this time, use case when then else end

create table emp_sex(
    name string,
    dept_id string,
    sex string
row format delimited fields terminated by '\t';

Hive SQL statement is as follows:
case is followed by the field name and when by the value. If it is the value, take the value 1 after then. If it is not, take 0. Finally, add up all 1 to sum

    sum(case sex when 'male' then 1 else 0 end) sum_male,
    sum(case sex when 'female' then 1 else 0 end) sum_female
group by  dept_id;

The result is:

dept_id	sum_male	sum_female
A	2	1
B	1	2
Time taken: 92.571 seconds, Fetched: 2 row(s)

In addition to the above functions, you can also use the if function to calculate.
First, if(exp1,exp2,exp3). If exp1 is true, it returns exp2; otherwise, it returns exp3

    sum(if(sex='male',1,0)) maleCount,
    sum(if(sex='female',1,0)) femaleCount
group by  dept_id;

The same result can be obtained.

Row to column

Description of correlation function

CONCAT(string A/col, string B/col...): returns the result of connecting input strings. Any input string is supported;
CONCAT_WS(separator, str1, str2,...): it is a special form of CONCAT(). The separator between the first parameter and the remaining parameters. The delimiter can be the same string as the remaining parameters. If the delimiter is NULL, the return value will also be NULL. This function skips any NULL and empty strings after the delimiter parameter. The separator will be added between the connected strings;
Note: CONCAT_WS must be "string or array
COLLECT_SET(col): the function only accepts basic data types. Its main function is to de summarize the values of a field and generate an Array type field.


hive (default)> select concat('a','-','b','-','c');

When we all use - splicing, CONCAT_WS can be used to omit writing

hive (default)> select concat_ws('-','a','b','c');


Splices with multiple written characters are omitted:
You can also splice arrays:

hive (default)> select concat_ws('.','www',array('baidu','com'));

Next, we need to deepen the application of knowledge,
Edit data: vim person_info.txt:

Sun WuKong	Aries	A
 sea	sagittarius	A
 Song song	Aries	B
 Zhu Bajie	Aries	A
 Miss Luo Yu feng	sagittarius	A
 Aoi Sora	Aries	B

Implement data in the following format:

sagittarius,A   	sea|Miss Luo Yu feng
 Aries,A	    Sun WuKong|Zhu Bajie
 Aries,B  	Song song|Aoi Sora

Create table person_info:

create table person_info(
name string,
constellation string,
blood_type string
row format delimited fields terminated by '\t';

Import data:

load data local inpath '/opt/module/datas/person_info.txt'
into table default.person_info;

The process and idea of writing sql statements:
1. According to the requirements of the target format, first splice the constellation and blood group together

    concat(constellation,',',blood_type) con_blood,

2. Gather the names of people with the same constellation blood group

    collect_set(name) name_arr
group by con_blood;t2

3. Process the array of name


4. The above is a process for writing this sql. Finally, sort it out to form the following complete sql:

    collect_set(name) name_arr
    concat(constellation,',',blood_type) con_blood,
group by con_blood)t2;

Note: UDF: one in one out, UDTF: one in many out, UDAF: many in one out, where collect_set,sum,count, etc. are aggregate functions, which belong to UDAF

Column to row

1) Function Description:
Expand (Col): split the complex Array or Map structure in the hive column into multiple rows.
Usage: final view udtf (expression) tablealias as columnalias
Explanation: used with split, expand and other udtfs. It can split a column of data into multiple rows of data. On this basis, it can aggregate the split data.

Data preparation, local file creation: test.txt:

songsong,bingbing_lili,xiao song:18_xiaoxiao song:19,hui long guan_beijing
yangyang,caicai_susu,xiao yang:18_xiaoxiao yang:19,chao yang_beijing

Create a table based on the format of the data:

create table test3(
name string,
friends array<string>,
children map<string,int>,
address struct<street:string,city:string>
row format delimited fields terminated by ','
collection items terminated by '_'
map keys terminated by ':'
lines terminated by '\n';

be careful:
1. When writing and creating this table, I made some records: when writing the type of map, the string is followed by a comma instead of a colon
2. After specifying the separator of each format, do not use commas for segmentation, and just wrap the line directly,
3. Lines instead of lines are used when writing line segmentation

Using the explode burst function, one in and many out, the results are as follows:

hive (default)> select explode(friends) from test3;

Test with another set of data:

create table movie_info(
    movie string,
    category string
row format delimited fields terminated by '\t';

vi movie.txt:

<Suspect tracking	Suspense,action,science fiction,plot
<Lie to me>	Suspense,gangster ,action,psychology,plot
<War wolf 2	Warfare,action,disaster

Import data:

load data local inpath '/opt/module/datas/movie.txt'
into table movie_info;

Here, you need to separate each data in the category column. If you directly use the expand function,

select explode(category) from movie_info; 

The following error will be reported:

UDFArgumentException explode() takes an array or a map as a parameter

The parameter of the expand function must be array or map. When we create the table, the category column is specified as string, so we need to convert it into an array first, which can be realized through the split method.

hive (default)> select split(category,',') from movie_info;
["Suspense","action","science fiction","plot"]
["Suspense","gangster ","action","psychology","plot"]

As shown in the figure above, convert it into an array through the split method.
Then, through the explode function, the elements in each row can be cut:

hive (default)> select explode(split(category,',')) from movie_info;
 science fiction

On the basis of the above, if you add a field after select, an error will appear, such as

select movie,explode(split(category,',')) from movie_info;

Errors are reported as follows:

ILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions

We want to add the corresponding movie name in front of each type above, that is, it needs to be associated with the fields in the original table, but the number of rows is not symmetrical. Therefore, we need to use the lateral view function to add the profile table:

from movie_info
lateral view 
explode(split(category,",")) movie_info_tmp as category_name;

The query results are as follows:

movie	category_name
<Suspect tracking	Suspense
<Suspect tracking	action
<Suspect tracking	science fiction
<Suspect tracking	plot
<Lie to me>	Suspense
<Lie to me>	gangster 
<Lie to me>	action
<Lie to me>	psychology
<Lie to me>	plot
<War wolf 2	Warfare
<War wolf 2	action
<War wolf 2	disaster

Note: row to column and column to row need to be clarified here.
Row to column: it is to convert multiple rows of data into columns. Think of the concat function
Column to row: explode the data in a column into multiple rows

Tags: Big Data Hadoop hive

Posted on Sun, 24 Oct 2021 14:25:23 -0400 by mospeed