Hive related optimization

one. hive related optimization one.one hive compression configuration

What's the use of compression?

'benefit': 
	Store more resources in limited space
'Disadvantages': 
	Compression and decompression require additional resources	
  • Optimize MR through compression to improve efficiency
Position one: 'map Phase output'
	'Benefit one': When reduce When pulling data, because the data has been compressed, the whole amount of data is reduced, so as to reduce the network bandwidth and improve the pulling efficiency
	'Benefit two': In some special cases, the whole MR only map No, reduce Yes, at this time map The output result is the final result. Compress the result, reduce disk storage and improve disk utilization

Position two:'reduce Phase to output'
	reduce The output of is the final output, Will fall on HDFS Form the final document on the, 
	Compress this file, Reduce disk storage, Improve disk utilization
  • What kind of data compression do you use?
What indicators do you need to look at when choosing a compression scheme?  Compression ratio and decompression performance

explain:  
	'Compression ratio: zlib and gz'
	'Decompression performance: LZO  snappy'
	
	'cost performance:  snappy'

summary: 
	When on data'Write many, Less reading'In case of, It is recommended to give priority to compression ratio : zlib (ODS layer)
	When on data'Read more'In case of, It is recommended to give priority to the performance of decompression: snappy (Other hierarchies)
	
	If in a production environment, The company's servers have plenty of disk space, It is recommended that no matter where you are, All use snappy Compression scheme
  • Compression scheme provided by hadoop

     matters needing attention:
     	about snappy Compression scheme, default apache Version hadoop Is not supported, If you want to use it, you need to hadoop Recompile
     	And some business environments hadoop Versions are basically directly supported snappy Compression scheme: such as CDH edition
    
  • Configuration of compression scheme

map Phase compression configuration:
one)open hive Intermediate transmission data compression function
set hive.exec.compress.intermediate=true;
2)open mapreduce in map Output compression function
set mapreduce.map.output.compress=true;
three)set up mapreduce in map Compression mode of output data
set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec;


reduce Phase compression configuration:
1)open hive Final output data compression function
set hive.exec.compress.output=true;
2)open mapreduce Final output data compression
set mapreduce.output.fileoutputformat.compress=true;
three)set up mapreduce Final data output compression mode
set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;

4)set up mapreduce The final data output is compressed into block compression
set mapreduce.output.fileoutputformat.compress.type=BLOCK;

1.2 hive data storage format

hive The supported formats for storing data mainly include: textfile(Line storage),sequencefile(Line storage) ORC(In line storage)
parquet(In line storage)
  • What is row storage and column storage? And what are their advantages and disadvantages?

ORC:

	1)Divide the data into multiple by script fragment
	2)Data is stored in columns in each segment
	3)In every script There is index information in the fragment

Through experimental comparison:
	ORC The format is not only stored but also queried, They are all excellent
  • Common table creation formats: special attention
create table log_orc_snappy(
    track_time string,
    url string,
    session_id string,
    referer string,
    ip string,
    end_user_id string,
    city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
-- Focus on the following two lines: The storage format is ORC At the same time, the data compression scheme is SNAPPY
STORED AS orc tblproperties ("orc.compress"="SNAPPY"); --Commonly used in other hierarchical tables
 perhaps:
STORED AS orc|textFile tblproperties ("orc.compress"="GZ|ZLIB"); --Commonly used in ODS Layer table

matters needing attention:
	only textFile Support through load data Method to load data, Other data structures,
	All need to pass insert + select Insert data as

1.3 other optimization of hive

1.3.1 hive fetch local grab

fetch Local crawl: running SQL Can I stay when I'm in trouble MR Try not to go MR Directly from HDFS Just read the data from the
  • In those cases can not go MR?
stay hive In, it can be enabled mainly through a configuration fetch Local crawl:
	hive.fetch.task.conversion
		Optional value:
			more (Default value)
			minimal
			none
more: When the value is more When, hive Will not go under the following circumstances MR
	Situation 1: Query all data  select * from surface
	Situation II: When querying certain fields  select field... from surface
	Situation III: When performing a simple filtering operation:  select field... from surface where field='value'
	Situation IV: implement limit When, I won't go either MR lookup: select * from surface limit 10

minimal: When the value is minimal When ,hive Will not go under the following circumstances MR
	Situation 1: Query all data  select * from surface
	Situation II: When querying certain fields  select field... from surface
	Situation III: implement limit When, I won't go either MR lookup: select * from surface limit 10
	
none: When the value is none When , hive All queries will be SQL go MR


In the production environment, The general choice is more ,and more Exactly the default, Therefore, no setting is required

1.3.2 hive local mode

hive Local mode of: In execution MR When, Try to run the scheme in local mode, Instead of submitting to yarn colony
relevant hive Configuration of:
	set hive.exec.mode.local.auto=true; open hive The default local mode is false
 Meet the conditions for local operation: When the following two conditions are met, Before you can go to local mode
	data size: set hive.exec.mode.local.auto.inputbytes.max=134217728; The default is 128 M
	Number of documents: set hive.exec.mode.local.auto.tasks.max=4 ; The default is 4

Production environment configuration:
	It is recommended to directly enable the local default, because hive Automatically identify whether you can go local
	
In actual use, This operation, Valid only in the test environment, Configure on production data, Actually, it doesn't make much sense, Because the production data far exceed this value

1.3.3 join query optimization

By default, how does the post translation MR implement the join operation?

Existing problems:       
		1) Data skew may occur       
		2) reduce High pressure

How to solve the problem?

  • Solution: map join
Suitable for small watch join Big watch situation

map join Related parameter configuration:
	set hive.auto.convert.join = true; -- Open map join Default to true
	set hive.mapjoin.smalltable.filesize= 25000000; -- Threshold for size table
  • join of medium and large tables:

    • Scheme 1: if you can filter in advance, it is recommended to filter it before the join to reduce the number of joins
    • Scheme 2: if there are a large number of nulls in the fields with key conditions in the table, it is recommended to replace the null value with a random number
    • Scheme 3: implement Map join (bucket map join) in bucket table
  • Large table and large table join:

    • Scheme 1: if you can filter in advance, it is recommended to filter it before the join to reduce the number of joins
    • Scheme 2: if there are a large number of nulls in the fields with key conditions in the table, it is recommended to replace the null value with a random number
    • Scheme 3: implement map join(SMB map join) in the bucket table
  •  Please note that: At present, large tables and small tables are combined join When, There is no order, indifferent, 
     		But in the old version, It is recommended to put the small table in front, Large table placed behind
    

1.3.4 optimization of group by

group by's data skew solution:

  • Solution 1: solve the problem through combiner (Protocol)
Suppose there is such a data as follows:
s01  Zhang San big data class 1
s02  Li Sida data class 1
s03  Wang Wuda data class 2
s04  Zhao Liuda data class 1
s05  Tianqi big data class 2
s06  Weekly big data class 1
s07  Li Jiuda data class 1

demand: How many people are there in each class?
	select Class name,count(1) as num   from  stu group by  class;


map stage: hypothesis map There are two

map1Task:
	k2        v2
  Big data class 1   {s01,Zhang San,Big data class 1}
  Big data class 1   {s02,Li Si,Big data class 1}
  Big data class 2   {s03,Wang Wu,Big data class 2} 
  Big data class 1   {s04,Zhao Liu,Big data class 1}
map2Task:
    k2        v2
  Big data class 2   {s05,pseudo-ginseng,Big data class 2}  
  Big data class 1   {s06,Zhou Ba,Big data class 1}
  Big data class 1   {s07,Li Jiu,Big data class 1}


reduce stage:   hypothesis reduce There are two



reduce1Task: Receiving big data shift 1
 Data received:
       k2           v2
	Big data class 1   {s01,Zhang San,Big data class 1}
    Big data class 1   {s02,Li Si,Big data class 1}
	Big data class 1   {s04,Zhao Liu,Big data class 1}
	Big data class 1   {s06,Zhou Ba,Big data class 1}
    Big data class 1   {s07,Li Jiu,Big data class 1}
Grouping operation:
	Big data class 1  [{s01,Zhang San,Big data class 1},{s02,Li Si,Big data class 1},{s04,Zhao Liu,Big data class 1},{s06,Zhou Ba,Big data class 1},{s07,Li Jiu,Big data class 1}]

reduce output  
       k3       v3
     Big data class 1 5

reduce2Task : Receiving big data shift 2
 Data received:
	   k2           v2
	  Big data class 2   {s03,Wang Wu,Big data class 2} 
      Big data class 2   {s05,pseudo-ginseng,Big data class 2}  

Grouping operation:
	Big data class 2  [{s03,Wang Wu,Big data class 2},{s05,pseudo-ginseng,Big data class 2} ]

reduce output  
       k3       v3
     Big data class 2
     
Found two reduce There is a problem of data skew between

Solution 1: combiner

Suppose there is such a data as follows:
s01  Zhang San big data class 1
s02  Li Sida data class 1
s03  Wang Wuda data class 2
s04  Zhao Liuda data class 1
s05  Tianqi big data class 2
s06  Weekly big data class 1
s07  Li Jiuda data class 1

demand: How many people are there in each class?
	select Class name,count(1) as num   from  stu group by  class;


map stage: hypothesis map There are two

map1Task:
	k2        v2
  Big data class 1   {s01,Zhang San,Big data class 1}
  Big data class 1   {s02,Li Si,Big data class 1}
  Big data class 2   {s03,Wang Wu,Big data class 2} 
  Big data class 1   {s04,Zhao Liu,Big data class 1}

combiner(Statute): Advance aggregation operation (take reduce The logic is advanced in each map Do it first in class)
Output results:
   k2         v2
  Big data class 1    3 
  Big data class 2    1
  
map2Task:
    k2        v2
  Big data class 2   {s05,pseudo-ginseng,Big data class 2}  
  Big data class 1   {s06,Zhou Ba,Big data class 1}
  Big data class 1   {s07,Li Jiu,Big data class 1}
combiner(Statute): Advance aggregation operation(take reduce The logic is advanced in each map Do it first in class)
Output results:
    k2        v2
  Big data class 1    2
  Big data class 2    1

reduce stage:  Two reduce

reduce1Task : Receiving big data shift 1
 Data received:
    Big data class 1    3 
    Big data class 1    2
 Grouping operation:
     Big data class 1   [3,2]
reduce output:
     k3         v3
    Big data class 1    5

reduce2Task: Receiving big data shift 2
    Big data class 2    1
    Big data class 2    1
 Grouping operation:
	Big data class 2   [1,1]

reduce output:
      k3        v3
     Big data class 2
     
 Is tilting resolved: Solved
  • Solution 2: large combiner (official: load balancing)
programme: Utilize multiple MR realization, first MR Divide all data evenly among different groups reduce, Have each reduce Calculate a local result, This process is the load balancing process (Advance aggregation operation)
the second MR: According to the same key To the same reduce, Implement the final aggregation operation

Case flow: 
Suppose there is such a data as follows:
s01  Zhang San big data class 1
s02  Li Sida data class 1
s03  Wang Wuda data class 2
s04  Zhao Liuda data class 1
s05  Tianqi big data class 2
s06  Weekly big data class 1
s07  Li Jiuda data class 1

demand: How many people are there in each class?
	select Class name,count(1) as num   from  stu group by  class;


first MR:  
map1Task:
	k2        v2
  Big data class 1   {s01,Zhang San,Big data class 1}
  Big data class 1   {s02,Li Si,Big data class 1}
  Big data class 2   {s03,Wang Wu,Big data class 2} 
  Big data class 1   {s04,Zhao Liu,Big data class 1}
map2Task:
    k2        v2
  Big data class 2   {s05,pseudo-ginseng,Big data class 2}  
  Big data class 1   {s06,Zhou Ba,Big data class 1}
  Big data class 1   {s07,Li Jiu,Big data class 1}
 
 Special point: Let the data be distributed randomly, Let every reduce All get the same number of data  (load balancing )
  
reduce receive data : 

reduce1Task:
Received:
   Big data class 1   {s01,Zhang San,Big data class 1}
   Big data class 2   {s03,Wang Wu,Big data class 2} 
   Big data class 1   {s06,Zhou Ba,Big data class 1}
   Big data class 1   {s07,Li Jiu,Big data class 1}
Output results:
    Big data class 1     3
    Big data class 2     1
reduce2Task:
Received:
   Big data class 1   {s02,Li Si,Big data class 1}
   Big data class 1   {s04,Zhao Liu,Big data class 1}
   Big data class 2   {s05,pseudo-ginseng,Big data class 2}  
 
Output results:
	Big data class 1    2
	Big data class 2    1

the second MR: According to the same key To the same reduce

map stage:
	Big data class 1     3
    Big data class 2     1
    Big data class 1    2
	Big data class 2    1

reduce:

reduce1Task: Big data class 1
 receive:
   Big data class 1     3
   Big data class 1     2
 result:
	Big data class 1    5

reduce2Task: Big data class 2
 receive:
	Big data class 2     1
	Big data class 2     1
 result:
	Big data class 2

Conclusion: the second scheme can solve the tilt problem better than the first one
Related configurations:

Scheme I configuration information:
	set hive.map.aggr = true;  -- open map End advance polymerization(combiner)
	set hive.groupby.mapaggr.checkinterval = 100000;  -- every last map Ability to aggregate maximum entries

Scheme II configuration information:
	set hive.groupby.skewindata = true;  

1.3.5 MR parallelism setting

Refers to: How to adjust mapTask Number of and reduceTask Number of
  • mapTask quantity adjustment:
mapTask What determines the number of?  Slicing decision of file, The slice size is the same as the default blcok The size is consistent
	It means that the condition cannot be directly map Number of, 'adjustment map The number needs to be from the slice of the file'

'For example mapTask The number becomes more: Set the slice size to be smaller or the number of input files to be larger'
'For example mapTask The number becomes smaller: Set the slice size to be larger or the number of input files to be smaller'

stay hive in, It mainly focuses on adjusting the number of files
  • Quantity adjustment of reduceTask:
stay hive in reduce The default quantity is -1 , from hive Automatically adjusted based on the amount of data

Adjustment scheme:
	hive.exec.reducers.bytes.per.reducer=256123456; -- every last reduce Default amount of data processed
	hive.exec.reducers.max=1009 ; -- One MR How many can be run in the reduce
	mapreduce.job.reduces=-1;  -- Manual setting reduce Number of, default -1 Automatic inference
	
In settings reduce These two principles should also be considered when calculating the number:
	'Handle large amounts of data and use appropriate reduce Number; Make single reduce The amount of data processed by the task shall be appropriate;'

1.3.6 explain view execution plan

explain Key words: view current SQL Implementation plan for,According to the implementation plan, Adjust some relevant optimization measures

Syntax format:
		explain SQL sentence;

1.3.7 parallel execution mechanism

	In execution SQL When, SQL There may be multiple stages in the execution of, There may be no dependencies between phases, 
	These phases are scheduled to execute in parallel at this time, So as to improve efficiency

Related configuration:

set hive.exec.parallel=true; Enable parallel execution
set hive.exec.parallel.thread.number=16; Maximum number of parallelism allowed

Advance: if you want to execute in parallel, you must have resources first. If you don't have resources, even if you can be parallel, you can't be concurrent

1.3.7 strict mode

stay hive in, To prevent some very poor performance SQL Implementation of, The strict mode is specially provided for restriction
1) For partitioned tables, unless where The statement contains partition field filter conditions to limit the scope, otherwise it is not allowed to execute
2) For use order by Statement must be used limit sentence
3) Query with restricted Cartesian product

1.3.8 it is speculated that it is generally closed and not used

Tags: Big Data Hadoop hive

Posted on Wed, 06 Oct 2021 20:41:53 -0400 by tzikis