Hive processes Json data

Data processing in Json format

Json data format is a commonly used data format, such as embedded point data, business end data and front and rear end calls. Therefore, it is necessary for us to learn the processing method of this data format

Prepare data

cat json.data

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}

Create hive table and load data

create table ods.ods_json_data(text string);
load data local inpath "/Users/XXX/workspace/hive/json.data" overwrite into table ods.ods_json_data;

get_json_object and json_tuple function

json_tuple does not support JSON nesting, but it supports obtaining the corresponding values of multiple top-level key s at one time

get_json_object does not support getting multiple values at once, but it supports complex JSON processing

get_json_object()

Usage: get_json_object(string json_string, string path) previously, we described how to view the usage of the function desc function get_json_object

Return value: String

Description: parse json string_ String to return the content specified by path. If the input json string is invalid, NUll is returned. This function can only return one data item at a time.

Specific example: get_json_object(value,’$.id’)

select get_json_object(text,"$.movie") from ods.ods_json_data;

The disadvantage of this function is that it can only return one value, that is, we can't extract multiple values from json at one time. If we want to extract multiple values, we need to call this function multiple times, but we will introduce json below_ Tuple is OK, but it doesn't mean that this function is not strong or anything. Remembering the api of this function can save you a lot of time

json_tuple

Usage: JSON_ The pn sorted by tuple (jsonstr, P1, P2,..., pn) is the key we want to extract

Return value: tuple(v1,...vn) the return value V1... VN here corresponds to the key P1.... PN

select json_tuple(text,'movie','rate','timeStamp','uid') from ods.ods_json_data;

json_tuple is equivalent to get_ Json_ The advantage of object is that it can parse multiple Json fields at a time.

Example demonstration

1. Processing of nested json

We talked about JSON earlier_ Tuple does not support the processing of nested JSON

select get_json_object('{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1","info":{"name":"god 's favored one"}}',"$.info.name")
select json_tuple('{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1","info":{"name":"god 's favored one"}}',"info.name")

2. Json array parsing (implemented by get_json_object)

SELECT get_json_object('[{"website":"www.ikeguang.com","name":"My life memory"},{"website":"beian.ikeguang.com","name":"keep on record"}]', '$.[0].website'), get_json_object('[{"website":"www.ikeguang.com","name":"My life memory"},{"website":"beian.ikeguang.com","name":"keep on record"}]', '$.[1].website');

At this time, you find that all I extract are websites in JSON array. Is there any simple way? Theoretically, get_json_object can only have one return value, and you need to write more than one anyway. Have you ever thought about a question? If I have 100 elements in this array that are JSON, and I need the website of each JSON, do I need to write it 100 times? If you carefully read the api of this function at this time, you will find another symbol*

SELECT get_json_object('[{"website":"www.ikeguang.com","name":"My life memory"},{"website":"beian.ikeguang.com","name":"keep on record"}]', '$.[*].website')

Now you know, get_json_object can only return one element, not only one string. The above is a JSON array. What if we parse the array from JSON?

SELECT get_json_object('{"info":[{"website":"www.ikeguang.com","name":"My life memory"},{"website":"beian.ikeguang.com","name":"keep on record"}]}', '$.info');

It should be noted that after the following operations, you will get a json string. Now you can process it again according to the above method

select get_json_object (get_json_object('{"info":[{"website":"www.ikeguang.com","name":"My life memory"},{"website":"beian.ikeguang.com","name":"keep on record"}]}', '$.info' ),'$.[1].website');

But sometimes we want to get directly instead of through such a nested method. At this time, we actually get the nested get above_ json_ The path parameter of the object function is combined

SELECT get_json_object('{"info":[{"website":"www.ikeguang.com","name":"My life memory"},{"website":"beian.ikeguang.com","name":"keep on record"}]}', '$.info[1].website');

At this time, if we go to * to bless, it will be very simple

SELECT get_json_object('{"info":[{"website":"www.ikeguang.com","name":"My life memory"},{"website":"beian.ikeguang.com","name":"keep on record"}]}', '$.info[*].website');

In fact, here we learned to specify a subscript of an array to obtain one element and * to obtain all elements. For example, I want to obtain the first three, even or odd. Ha ha, if you look back at the api, you know that a Union operator is provided to specify any subscript you want to combine to obtain

SELECT get_json_object('{"info":[{"website":"www.ikeguang.com","name":"My life memory"},{"website":"beian.ikeguang.com","name":"keep on record"}]}', '$.info[0,1].website');

Let's try to get even, odd or even numbers within a certain range. In fact, they are the array slices provided above. You can refer to the api for use

SELECT get_json_object('{"info":[{"website":"www.ikeguang.com","name":"My life memory"},{"website":"beian.ikeguang.com","name":"keep on record"},{"website":"www.ikeguang2.com","name":"My life memory"}]}', '$.info[0:2:2].website');

But I tried it and found that there was a bug in this function, which could not achieve the effect of slicing. I returned all of them every time

SELECT get_json_object('{"info":[
	{"website":"www.ikeguang.com","name":"My life memory"},
	{"website":"beian.ikeguang.com","name":"keep on record"},
	{"website":"www.ikeguang2.com","name":"My life memory"}]}', '$.info[0:2:2].website');

Load JSON data

For the above json.data data, can we process it when the data is loaded to hive instead of when it is used after loading, especially for this JSON format with a less complex nested structure

create table ods.ods_json_parse_data(
movie string,
rate string,
`timeStamp` string,
uid string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;
load data local inpath "/Users/liuwenqiang/workspace/hive/json.data" overwrite into table ods.ods_json_parse_data;

In this method, you should pay attention to the fact that your data type and field name must match, otherwise you will report an error or cannot obtain the value. In fact, it is also possible to add a nested field based on the above data

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1","info":{"name":"god 's favored one"}}
create table ods.ods_json_parse_data2(
movie string,
rate string,
`timeStamp` string,
uid string,
info map<string,string>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;
load data local inpath "/Users/liuwenqiang/workspace/hive/json.data" overwrite into table ods.ods_json_parse_data2;

summary

  1. get_json_object and JSON_ Usage scenario of tuple function and its advantages and disadvantages
  2. If the json format is relatively simple, you can dispose of the json when creating a table and loading data. If it is complex, you can also parse a part when loading, and then parse it through SQL
  3. You can also try to write some UDF functions to handle JSON

Pay attention to the official account: big data technology, reply to "information" and receive 1024G information.

Tags: Hadoop hive

Posted on Wed, 01 Dec 2021 09:41:51 -0500 by coderb