HBase (medium)
12. Integration of HBase and MapReduce
The data in HBase is ultimately stored on HDFS. HBase naturally supports Mr operation. We can directly process the data in HBase through Mr, and MR can directly store the processed results in HBase
Requirement: read the data of one table in HBase, and then write the data to another table in HBase. Note: we can use TableMapper and TableReducer to read and write data from HBase
http://hbase.apache.org/2.0/book.html#mapreduce
Requirement 1: read the data in the myuser table and write it to another table in HBase
Here, we write the name and age fields of the f1 column family in the myuser table into the f1 column family in the myuser2 table
Step 1: create the myuser2 table
Note: the name of the column family should be the same as that of the column family in the myuser table
hbase(main):010:0> create 'myuser2','f1'
Step 2: create maven project and import jar package
Note: on the basis of importing jar packages from previous projects, add the following jar packages
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-*mapreduce --> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-mapreduce</artifactId> <version>2.0.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version> 2.7.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.5</version> </dependency>
Step 3: develop MR program
Define mapper class
public class HBaseMapper extends TableMapper<Text,Put> { /** - @param key rowkey - @param value Encapsulates our line of data - @param context - @throws IOException - @throws InterruptedException - **/ @Override protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException { // f1 name age f2 xxx //Get our rowkey byte[] bytes = key.get(); Put put = new Put(bytes); //Get all the columns in the Result List<Cell> cells = value.listCells(); for (Cell cell : cells) { //Determine which column family you belong to byte[] family = CellUtil.cloneFamily(cell); //Get which column the cell belongs to byte[] qualifier = CellUtil.cloneQualifier(cell); if(Bytes.toString(family).equals("f1")){ if(Bytes.toString(qualifier).equals("name") || Bytes.toString(qualifier).equals("age")){ put.add(cell); } } } if(!put.isEmpty()){ context.write(new Text(Bytes.toString(bytes)),put); } } }
Define reducer class
/** - Text key2 Type of - Put value2 type - ImmutableBytesWritable k3 Type of - V3 Type of information??? - put 'myuser2','rowkey','f1:name','zhangsan' - javaAPI Write through the put object - **/ public class HBaseReducer extends TableReducer<Text,Put,ImmutableBytesWritable> { /** - - @param key It's our key2 - @param values It's our v2 - @param context Write our data out - @throws IOException - @throws InterruptedException -**/ @Override protected void reduce(Text key, Iterable<Put> values, Context context) throws IOException, InterruptedException { for (Put put : values) { context.write(new ImmutableBytesWritable(key.toString().getBytes()),put); } } }
Define the main method for program running
public class HBaseMrMain extends Configured implements Tool { @Override public int run(String[] args) throws Exception { Job job = Job.getInstance(super.getConf(), "hbaseMR"); Scan scan = new Scan(); //Use the tool to initialize our mapper class /** - String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, Job job - **/ TableMapReduceUtil.initTableMapperJob("myuser",scan,HBaseMapper.class, Text.class, Put.class,job); /** - String table, Class<? extends TableReducer> reducer, Job job - **/ TableMapReduceUtil.initTableReducerJob("myuser2",HBaseReducer.class,job); boolean b = job.waitForCompletion(true); return b?0:1; } public static void main(String[] args) throws Exception { Configuration configuration = HBaseConfiguration.create(); configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181"); int run = ToolRunner.run(configuration, new HBaseMrMain(), args); System.exit(run); } }
Step 4: run
The first mode of operation: local operation
Directly select the class where the main method is located and run it
The second mode of operation: package cluster operation
Note that we need to use the packaging plug-in to put the dependent jar packages of HBase into the project jar package
Step 1: add a packaging plug-in to pom.xml
<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.4.3</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <minimizeJar>true</minimizeJar> </configuration> </execution> </executions> </plugin>
Step 2: add
job.setJarByClass(HBaseMain.class);
Step 3: package with maven
Then package, upload the jar package to the linux server, and execute
yarn jar hbaseStudy-1.0-SNAPSHOT.jar cn.itcast.hbasemr.HBaseMR
Or we can set our own environment variables and run the smaller jar package of original
export HADOOP_HOME=/export/servers/hadoop-2.7.5/
export HBASE_HOME=/export/servers/hbase-2.0.0/
export HADOOP_CLASSPATH=${HBASE_HOME}/bin/hbase mapredcp
yarn jar original-hbaseStudy-1.0-SNAPSHOT.jar cn.itcast.hbasemr.HbaseMR
Requirement 2: read the HDFS file and write it into the HBase table
Read the hdfs path / hbase/input/user.txt, and then write the data to the myuser2 table
Step 1: prepare data files
Prepare data files and upload them to HDFS
hdfs dfs -mkdir -p /hbase/input cd /export/servers/ vim user.txt 0007 zhangsan 18 0008 lisi 25 0009 wangwu 20 Upload to hdfs Go up hdfs dfs -put user.txt /hbase/input
Step 2: develop MR program
Define mapper class
public class HdfsMapper extends Mapper<LongWritable,Text,Text,NullWritable> { /* We did not do any processing in the map phase and directly wrote out our data **/ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { context.write(value,NullWritable.get()); } }
Define reducer class
public class HBaseWriteReducer extends TableReducer<Text,NullWritable,ImmutableBytesWritable> { /** - @param key - @param values - @param context - @throws IOException - @throws InterruptedException -**/ @Override protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException { String[] split = key.toString().split("\t"); Put put = new Put(split[0].getBytes()); put.addColumn("f1".getBytes(),"name".getBytes(),split[1].getBytes()); put.addColumn("f1".getBytes(),"age".getBytes(),split[2].getBytes()); //Immutablebytesworitable can install our rowkey or our value value, etc context.write(new ImmutableBytesWritable(split[0].getBytes()),put); } }
Define the main method for program running
public class Hdfs2HBaseMain extends Configured implements Tool { @Override public int run(String[] args) throws Exception { Job job = Job.getInstance(super.getConf(), "hdfsToHbase"); //Step 1: read the file and parse it into key and value pairs job.setInputFormatClass(TextInputFormat.class); TextInputFormat.addInputPath(job,new Path("hdfs://node01:8020/hbase/input")); //Step 2: customize the mapper to receive k1 v1 and convert it into a new k2 v2 output job.setMapperClass(HdfsMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(NullWritable.class);
//Step 3: partition
//Step 4: sort
//Step 5: Protocol
//Step 6: grouping
//Step 7: reduce logic, receive K2 v2 and convert it into a new k3 v3 output
TableMapReduceUtil.initTableReducerJob("myuser2",HBaseWriteReducer.class,job);
//Step 8: output data
boolean b = job.waitForCompletion(true); return b?0:1; } public static void main(String[] args) throws Exception { Configuration configuration = HBaseConfiguration.create(); configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181"); int run = ToolRunner.run(configuration, new Hdfs2HBaseMain(), args); System.exit(run); } } public class Hdfs2Hbase extends Configured implements Tool{ @Override public int run(String[] args) throws Exception { Job job = Job.getInstance(super.getConf(), "hdfs2Hbase"); job.setJarByClass(Hdfs2Hbase.class); job.setInputFormatClass(TextInputFormat.class); TextInputFormat.addInputPath(job,new Path("hdfs://node01:8020/hbase/input")); job.setMapperClass(HdfsMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(NullWritable.class); TableMapReduceUtil.initTableReducerJob("myuser2",HBaseReducer.class,job); job.setNumReduceTasks(1); boolean b = job.waitForCompletion(true); return b?0:1; } public static void main(String[] args) throws Exception { Configuration configuration = HBaseConfiguration.create(); int run = ToolRunner.run(configuration, new Hdfs2Hbase(), args); System.exit(run); } public static class HdfsMapper extends Mapper<LongWritable,Text,Text,NullWritable>{ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { context.write(value,NullWritable.get()); } } public static class HBaseReducer extends TableReducer<Text,NullWritable,ImmutableBytesWritable>{ @Override protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException { String[] split = key.toString().split("\t"); Put put = new Put(Bytes.toBytes(split[0])); put.addColumn("f1".getBytes(),"name".getBytes(),split[1].getBytes()); put.addColumn("f1".getBytes(),"age".getBytes(),Bytes.toBytes(Integer.parseInt(split[2]))); context.write(new ImmutableBytesWritable(Bytes.toBytes(split[0])),put); } } }
Requirement 3: bulk load data into HBase
There are many ways to load data into HBase. We can use the Java API of HBase or sqoop to write or import our data into HBase, but these methods are either slow or occupy Region resources in the import process, resulting in low efficiency. We can also use the MR program, Directly convert our data into the final storage format HFile of HBase, and then directly load the data into HBase
Each Table in HBase is stored in a folder under the root directory (/ HBase). The Table name is the folder name. Under the Table folder, each Region is also stored in a folder. Each column family under each Region folder is also stored in a folder, and some HFile files are stored under each column family. HFile is the storage format of HBase data under HFDS, Therefore, the final representation of HBase storage files on hdfs is HFile. If we can directly convert the data to HFile format, our HBase can directly read the files loaded in HFile format
advantage:
1. The import process does not occupy Region resources
2. It can quickly import massive data
3. Save memory
Normal reading and writing process of HBase data
Use bulkload to directly generate our data into HFile format, and then directly load it into the table of HBase
Requirements: convert the data file of the path / hbase/input/user.txt above hdfs into HFile format, and then load it into the table myuser2
Step 1: define our mapper class
/** - LongWritable k1 type - Text V1 type - ImmutableBytesWritable rowkey - Put Inserted objects - **/ public class BulkLoadMapper extends Mapper<LongWritable,Text,ImmutableBytesWritable,Put> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] split = value.toString().split("\t"); Put put = new Put(split[0].getBytes()); put.addColumn("f1".getBytes(),"name".getBytes(),split[1].getBytes()); put.addColumn("f1".getBytes(),"age".getBytes(),split[2].getBytes()); context.write(new ImmutableBytesWritable(split[0].getBytes()),put); } } ### Step 2: develop our main program entry class public class BulkLoadMain extends Configured implements Tool { @Override public int run(String[] args) throws Exception { Configuration conf = super.getConf(); Connection connection = ConnectionFactory.createConnection(conf); Table table = connection.getTable(TableName.valueOf("myuser2")); Job job = Job.getInstance(conf, "bulkLoad"); //Read the file and parse it into key and value pairs job.setInputFormatClass(TextInputFormat.class); TextInputFormat.addInputPath(job,new Path("hdfs://node01:8020/hbase/input")); //Define our mapper class job.setMapperClass(BulkLoadMapper.class); job.setMapOutputKeyClass(ImmutableBytesWritable.class); job.setMapOutputValueClass(Put.class); //The reduce process is also omitted /** - Job job, Table table, RegionLocator regionLocator - Configure incremental load is used to configure which column family in which table our HFile is loaded - **/ HFileOutputFormat2.configureIncrementalLoad(job,table,connection.getRegionLocator(TableName.valueOf("myuser2"))); //Set our output type and output our data in HFile format job.setOutputFormatClass(HFileOutputFormat2.class); //Set our output path HFileOutputFormat2.setOutputPath(job,new Path("hdfs://node01:8020/hbase/hfile_out")); boolean b = job.waitForCompletion(true); return b?0:1; } public static void main(String[] args) throws Exception { Configuration configuration = HBaseConfiguration.create(); configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181"); int run = ToolRunner.run(configuration, new BulkLoadMain(), args); System.exit(run); } }
Step 3: type the code into a jar package and run it
yarn jar original-hbaseStudy-1.0-SNAPSHOT.jar cn.itcast.hbasemr.HBaseLoad
Step 4: develop code and load data
Load the HFile file under our output path into our hbase table
public class LoadData { public static void main(String[] args) throws Exception { Configuration configuration = HBaseConfiguration.create(); configuration.set("hbase.zookeeper.property.clientPort", "2181"); configuration.set("hbase.zookeeper.quorum", "node01,node02,node03"); Connection connection = ConnectionFactory.createConnection(configuration); Admin admin = connection.getAdmin(); Table table = connection.getTable(TableName.valueOf("myuser2")); LoadIncrementalHFiles load = new LoadIncrementalHFiles(configuration); load.doBulkLoad(new Path("hdfs://node01:8020/hbase/output_hfile"), admin,table,connection.getRegionLocator(TableName.valueOf("myuser2"))); } }
Or we can load data through the command line
First, add the jar package of hbase to the classpath of hadoop
export HBASE_HOME=/export/servers/hbase-2.0.0/
export HADOOP_HOME=/export/servers/hadoop-2.7.5/
export HADOOP_CLASSPATH=${HBASE_HOME}/bin/hbase mapredcp
Then execute the following command to import the HFile of hbase directly into the table myuser2
yarn jar /export/servers/hbase-2.0.0/lib/hbase-server-1.2.0-cdh5.14.0.jar completebulkload /hbase/hfile_out myuser2
13. Comparison of HBase and hive
Hive
Data warehouse tools
The essence of Hive is actually equivalent to making a bijection relationship between the files already stored in HDFS in Mysql to facilitate the use of HQL to manage queries.
For data analysis and cleaning
Hive is suitable for offline data analysis and cleaning with high delay
Based on HDFS and MapReduce
The data stored in Hive is still on the DataNode, and the HQL statement written will eventually be converted into MapReduce code for execution.
HBase
nosql database
Is a non relational database for column storage.
Used to store structured and unstructured data
It is applicable to the storage of single table non relational data. It is not suitable for association query, such as JOIN and so on.
Based on HDFS
The embodiment of data persistence storage is Hfile, which is stored in DataNode and managed by ResionServer in the form of region.
Low delay, access to online services
In the face of a large amount of enterprise data, HBase can store a large amount of data in a straight-line single table, and provide efficient data access speed at the same time.
Summary: Hive and HBase
Hive and HBase are two different technologies based on Hadoop. Hive is a kind of SQL Engine and runs MapReduce tasks. HBase is a NoSQL key / value database based on Hadoop. These two tools can be used at the same time. Just like using Google to search and FaceBook to socialize, hive can be used for statistical query, HBase can be used for real-time query, and data can also be written from hive to HBase or from HBase to hive.
14. Integration of hive and HBase
Hive and our HBase have their own advantages and different functions, but in the final analysis, the data of hive and HBase are ultimately stored on hdfs. Generally, in order to store disk space, we do not store one piece of data in multiple places, resulting in a waste of disk space. We can directly store the data in HBase, Then integrate HBase through hive and directly use sql statements to analyze the data in HBase, which is very convenient
Requirement 1: save the data of hive analysis results to HBase
Step 1: copy the five dependent jar packages of hbase to the lib directory of hive
Copy the five jar packages of HBase to the lib directory of hive
The jar packages of HBase are in / export/servers/hbase-2.0.0/lib
We need to copy five jar packages with the following names
hbase-client-2.0.0.jar
hbase-hadoop2-compat-2.0.0.jar
hbase-hadoop-compat-2.0.0.jar
hbase-it-2.0.0.jar
hbase-server-2.0.0.jar
We directly execute the following command in node03 to make the jar package dependent by creating a soft connection
ln -s /export/servers/hbase-2.0.0/lib/hbase-client-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-client-2.0.0.jar ln -s /export/servers/hbase-2.0.0/lib/hbase-hadoop2-compat-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-hadoop2-compat-2.0.0.jar ln -s /export/servers/hbase-2.0.0/lib/hbase-hadoop-compat-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-hadoop-compat-2.0.0.jar ln -s /export/servers/hbase-2.0.0/lib/hbase-it-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-it-2.0.0.jar ln -s /export/servers/hbase-2.0.0/lib/hbase-server-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-server-2.0.0.jar
Step 2: modify hive's configuration file
Edit the hive configuration file hive-site.xml on the node03 server and add the following two lines of configuration
cd /export/servers/apache-hive-2.1.0-bin/conf
vim hive-site.xml
<property> <name>hive.zookeeper.quorum</name> <value>node01,node02,node03</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>node01,node02,node03</value> </property>
Step 3: modify the hiv-env.sh configuration file and add the following configuration
cd /export/servers/apache-hive-2.1.0-bin/conf
vim hive-env.sh
export HADOOP_HOME=/export/servers/hadoop-2.7.5 export HBASE_HOME=/export/servers/hbase-2.0.0 export HIVE_CONF_DIR=/export/servers/apache-hive-2.1.0-bin/conf
Step 4: create a table in hive and load the following data
hive intermediate table
Enter hive client
cd /export/servers/apache-hive-2.1.0-bin/
bin/hive
Create the hive database and the database table corresponding to hive
create database course; use course; create external table if not exists course.score(id int,cname string,score int) row format delimited fields terminated by '\t' stored as textfile ;
The prepared data are as follows:
node03 executes the following command to prepare the data file
cd /export/
vim hive-hbase.txt
1 zhangsan 80 2 lisi 60 3 wangwu 30 4 zhaoliu 70
Load data
Enter hive client to load data
hive (course)> load data local inpath '/export/hive-hbase.txt' into table score;
hive (course)> select * from score;
Step 5: create hive management table to map with HBase
We can create a hive management table to map with the table in hbase. The data in the hive management table will be stored on hbase
Create internal tables in hive
create table course.hbase_score(id int,cname string,score int) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties("hbase.columns.mapping" = "cf:name,cf:score") tblproperties("hbase.table.name" = "hbase_score"); #Insert data through insert overwrite select insert overwrite table course.hbase_score select id,cname,score from course.score;
Step 6: view the hbase table in hbase_score
Enter the client of hbase to view the hbase table_ Score and view the data
hbase(main):023:0> list TABLE hbase_score myuser myuser2 student user 5 row(s) in 0.0210 seconds => ["hbase_score", "myuser", "myuser2", "student", "user"] hbase(main):024:0> scan 'hbase_score' ROW COLUMN+CELL 1 column=cf:name, timestamp=1550628395266, value=zhangsan 1 column=cf:score, timestamp=1550628395266, value=80 2 column=cf:name, timestamp=1550628395266, value=lisi 2 column=cf:score, timestamp=1550628395266, value=60 3 column=cf:name, timestamp=1550628395266, value=wangwu 3 column=cf:score, timestamp=1550628395266, value=30 4 column=cf:name, timestamp=1550628395266, value=zhaoliu 4 column=cf:score, timestamp=1550628395266, value=70 4 row(s) in 0.0360 seconds
Requirement 2: create hive external tables and map existing table models in HBase,
Step 1: create a table in HBase and manually insert and load some data
Enter the shell client of HBase, manually create a table, insert and load some data
create 'hbase_hive_score',{ NAME =>'cf'} put 'hbase_hive_score','1','cf:name','zhangsan' put 'hbase_hive_score','1','cf:score', '95' put 'hbase_hive_score','2','cf:name','lisi' put 'hbase_hive_score','2','cf:score', '96' put 'hbase_hive_score','3','cf:name','wangwu' put 'hbase_hive_score','3','cf:score', '97'
The operation was successful and the results are as follows:
hbase(main):049:0> create 'hbase_hive_score',{ NAME =>'cf'} 0 row(s) in 1.2970 seconds => Hbase::Table - hbase_hive_score hbase(main):050:0> put 'hbase_hive_score','1','cf:name','zhangsan' 0 row(s) in 0.0600 seconds hbase(main):051:0> put 'hbase_hive_score','1','cf:score', '95' 0 row(s) in 0.0310 seconds hbase(main):052:0> put 'hbase_hive_score','2','cf:name','lisi' 0 row(s) in 0.0230 seconds hbase(main):053:0> put 'hbase_hive_score','2','cf:score', '96' 0 row(s) in 0.0220 seconds hbase(main):054:0> put 'hbase_hive_score','3','cf:name','wangwu' 0 row(s) in 0.0200 seconds hbase(main):055:0> put 'hbase_hive_score','3','cf:score', '97' 0 row(s) in 0.0250 seconds
Step 2: create the external table of hive and map the tables and fields in HBase
Create an external table in hive,
Enter the hive client, and then execute the following command to create the hive external table to map the table data in HBase
CREATE external TABLE course.hbase2hive(id int, name string, score int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:name,cf:score") TBLPROPERTIES("hbase.table.name" ="hbase_hive_score");
15. Pre partition of HBase
1. Why pre partition?
- Increase data reading and writing efficiency
- Load balancing to prevent data skew
- Convenient cluster disaster recovery scheduling region
- Optimize the number of maps
2. How to pre partition?
Each region maintains startRow and endRowKey. If the added data meets the rowKey range maintained by a region, the data will be handed over to the region for maintenance.
3. How to set the pre partition?
1. Manually specify pre partition
hbase(main):001:0> create 'staff','info','partition1',SPLITS => ['1000','2000','3000','4000']
After completion, as shown in the figure:
2. Generate pre partition using hexadecimal algorithm
hbase(main):003:0> create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
After completion, as shown in the figure:
3. Creating pre partitions using java APIs
The Java code is as follows:
```java /** - Create HBase tables and pre partition through Java API - **/ @Test public void hbaseSplit() throws IOException { //Get connection Configuration configuration = HBaseConfiguration.create(); configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181,node03:2181"); Connection connection = ConnectionFactory.createConnection(configuration); Admin admin = connection.getAdmin(); //Customize the algorithm to generate a series of Hash hash values, which are stored in a two-dimensional array byte[][] splitKeys = {{1,2,3,4,5},{'a','b','c','d','e'}}; //HTableDescriptor is used to set the parameters of our table, including table name, column family and so on HTableDescriptor hTableDescriptor = new HTableDescriptor(TableName.valueOf("staff3")); //Add column family hTableDescriptor.addFamily(new HColumnDescriptor("f1")); //Add column family hTableDescriptor.addFamily(new HColumnDescriptor("f2")); admin.createTable(hTableDescriptor,splitKeys); admin.close(); }
16. rowKey design skills of HBase
HBase is stored in three-dimensional order. The data in HBase can be quickly located through three dimensions: rowkey (row key), column key (column family and qualifier) and TimeStamp (TimeStamp).
rowkey in HBase can uniquely identify a row of records. There are several ways to query in HBase:
- Through the get method, specify the rowkey to obtain a unique record
- Set the startRow and stopRow parameters in scan mode for range matching
- Full table scan, that is, directly scan all row records in the whole table
1. Rowkey length principle
rowkey is a binary code stream, which can be any string with a maximum length of 64kb. In practical application, it is generally 10-100bytes, which is saved in the form of byte [] and is generally designed to have a fixed length.
It is recommended to be as short as possible, not more than 16 bytes, for the following reasons:
- The data persistence file HFile is stored according to the KeyValue. If the rowkey is too long, such as more than 100 bytes and 1000w lines of data, the rowkey alone will occupy 100 * 1000w = 1 billion bytes, nearly 1G of data, which will greatly affect the storage efficiency of HFile;
- MemStore caches some data into memory. If the rowkey field is too long, the effective utilization of memory will be reduced. The system cannot cache more data, which will reduce the retrieval efficiency.
2 rowkey hash principle
If the rowkey is incremented by timestamp, do not put the time in front of the binary code. It is recommended to use the high bit of the rowkey as a hash field, which is randomly generated by the program, and the low bit as a time field. This will improve the probability of data balanced distribution in each RegionServer to achieve load balancing. If there is no hash field, the first field is the time information directly, and all data will be concentrated on one RegionServer. In this way, the load will be concentrated on individual regionservers during data retrieval, causing hot issues and reducing the query efficiency.
3 rowkey uniqueness principle
Its uniqueness must be guaranteed in design. Rowkeys are stored in dictionary order. Therefore, when designing rowkey s, we should make full use of the sorting characteristics, store frequently read data in one piece, and put recently accessed data in one piece.
4 what are hot spots
The rows in HBase are sorted according to the dictionary order of rowkey. This design optimizes the scan operation and can access the relevant rows and the rows that will be read together in a nearby position for scan. However, poor rowkey design is the source of hot spots.
Hot spots occur when a large number of client s directly access one or a few nodes of the cluster (access may be read, write or other operations). A large number of accesses will cause the single machine where the hot region is located to exceed its capacity, resulting in performance degradation and even unavailability of the region, which will also affect other regions on the same RegionServer because the host cannot serve the requests of other regions.
A good data access mode is designed to make full and balanced use of the cluster. In order to avoid write hotspots, rowkey s are designed so that different rows are in the same region, but in the case of more data, the data should be written to multiple regions of the cluster instead of one. Here are some common ways to avoid hot spots and their advantages and disadvantages:
1 add salt
The salt adding here is not salt adding in cryptography, but adding a random number in front of the rowkey, specifically assigning a random prefix to the rowkey to make it different from the beginning of the previous rowkey. The number of prefix types allocated should be consistent with the number of data you want to use scattered to different regions. After adding salt, the rowkeys will be scattered to each region according to the randomly generated prefix to avoid hot spots.
2 hash
Hashing will always salt the same line with a prefix. Hashing can also spread the load across the cluster, but reads are predictable. Using the determined hash allows the client to reconstruct the complete rowkey, and use the get operation to accurately obtain a row data.
3 reverse
The third way to prevent hot spots is to reverse the rowkey in fixed length or digital format. This allows the parts of the rowkey that change frequently (the most meaningless part) to be placed first. This can effectively random rowkeys, but at the expense of the orderliness of rowkeys.
The example of reversing the rowkey takes the mobile phone number as the rowkey, and the string after reversing the mobile phone number can be used as the rowkey, so as to avoid hot issues caused by starting with a fixed mobile phone number
3 timestamp reversal
A common data processing problem is to quickly obtain the latest version of data. Using the inverted timestamp as part of rowkey is very useful for this problem. You can use Long.Max_Value - timestamp is appended to the end of the key. For example, [key] [reverse_timestamp], the latest value of [key] can obtain the first record of [key] through scan [key], because rowkeys in HBase are ordered and the first record is the last entered data.
Other suggestions:
Minimize the size of row keys and column families. In HBase, value is always transmitted with its key. When a specific value is transmitted between systems, its rowkey, column name and timestamp will also be transmitted together. If your rowkey and column name are large, they will take up a lot of storage space at this time.
The column family should be as short as possible, preferably one character.
Long attribute names are readable, but shorter attribute names are better stored in HBase.
17. HBase coprocessor
http://hbase.apache.org/book.html#cp
1. Origin
The most frequently criticized features of HBase as a column family database include: it is difficult to easily establish a "secondary index", and it is difficult to perform operations such as summation, counting and sorting. For example, in the old version (< 0.92) HBase, the total number of rows in the statistics table can be obtained only by using the Counter method and executing MapReduce Job once. Although HBase is integrated in the data storage layer
With MapReduce, it can be effectively used for distributed computing of data tables. However, in many cases, when doing some simple addition or aggregation calculations, if the calculation process is directly placed on the server side, the communication overhead can be reduced and a good performance improvement can be obtained. Therefore, HBase introduced coprocessors after 0.92 to realize some functions
New features of human heart: it can easily establish secondary index, complex filter (predicate push down) and access control.
2. There are two coprocessors: observer and endpoint
(1) Observer is similar to the trigger in traditional database. When some events occur, this kind of coprocessor will be called by the Server side. Observer Coprocessor is some hook hooks scattered in HBase Server-side code, which are called when fixed events occur. For example, there is a hook function prePut before the put operation, which is used in the put operation
It will be called by the Region Server before execution; After the put operation, there is a postPut hook function
Taking HBase 2.0.0 as an example, it provides three observer interfaces:
● RegionObserver: provide data manipulation event hooks of the client: Get, Put, Delete, Scan, etc.
● WALObserver: provide WAL related operation hooks.
● MasterObserver: provide DDL type operation hook. Such as creating, deleting and modifying data tables.
To version 0.96, another regionserver observer is added
The following figure illustrates the principle of this coprocessor with RegionObserver as an example:
(2) Endpoint coprocessors are similar to stored procedures in traditional databases. Clients can call these endpoint coprocessors to execute a section of Server-side code and return the results of Server-side code to clients for further processing. The most common usage is aggregation. If there is no coprocessor, when the user needs to find the largest data in a table, that is
max aggregation operation, you must perform a full table scan, traverse the scan results in the Client code, and perform the operation of finding the maximum value. Such a method can not make use of the concurrency of the underlying cluster, but centralize all calculations to the Client for unified execution, which is bound to be inefficient. With Coprocessor, users can deploy the code for maximum value to HBase Server,
HBase will use multiple nodes of the underlying cluster to perform the maximum operation concurrently. That is, execute the code to calculate the maximum value within each Region, calculate the maximum value of each Region on the Region Server side, and only return the max value to the client. Further process the maximum values of multiple regions at the client to find the maximum value.
In this way, the overall implementation efficiency will be greatly improved
The following figure shows how EndPoint works:
(3) Summary
Observer allows the cluster to behave differently during normal client operations
Endpoint allows you to extend the capabilities of the cluster and open new computing commands to client applications
observer is similar to the trigger in RDBMS. It mainly works on the server
endpoint is similar to the stored procedure in RDBMS and mainly works on the client side
observer can realize such functions as permission management, priority setting, monitoring, ddl control, secondary index, etc
endpoint can realize min, max, avg, sum, distinct, group by and other functions
3. Coprocessor loading mode
There are two loading methods for coprocessors, which we call Static Load and Dynamic Load. Statically loaded coprocessors are called system coprocessors, and dynamically loaded coprocessors are called table coprocessors
1. Static loading
By modifying the hbase-site.xml file, you can start the global aggregation and manipulate the data on all tables. Just add the following code:
<property> <name>hbase.coprocessor.user.region.classes</name> <value>org.apache.hadoop.hbase.coprocessor.AggregateImplementation</value> </property>
One cp class is loaded for all table s, and multiple classes can be loaded with "," split
2. Dynamic loading
Enable table aggregation to take effect only for specific tables. It is implemented through HBase Shell.
Disable specifies the table. hbase> disable ‘mytable’
Add aggregation
hbase> alter 'mytable', METHOD => 'table_att','coprocessor'=> '|org.apache.Hadoop.hbase.coprocessor.AggregateImplementation||' #Restart specified table hbase> enable 'mytable'
Coprocessor uninstall
4. Coprocessor Observer application practice
Insert data into one table in hbase through the coprocessor Observer, and then copy and save the data to another table through the coprocessor, but only take part of the column data in the first table and save it to the second table
Step 1: create the first table proc1 in HBase
Create a table in HBase with the name user2 and only one column family info
cd /export/servers/hbase-2.0.0/ bin/hbase shell hbase(main):053:0> create 'proc1','info'
Step 2: create the second table proc2 in Hbase
Create the second table 'proc2' as the target table, and copy some columns inserted into the first table into the 'proc2 table using the coprocessor
hbase(main):054:0> create 'proc2','info'
Step 3: develop HBase coprocessor
Copo for HBase development
**public class** MyProcessor **implements** RegionObserver,RegionCoprocessor {**static** Connection *connection* = **null**;**static** Table *table* = **null**;**static**{ Configuration conf = HBaseConfiguration.*create*(); conf.set(**"hbase.zookeeper.quorum"**,**"node01:2181"**);**try** {*connection* = ConnectionFactory.*createConnection*(conf);*table* = *connection*.getTable(TableName.*valueOf*(**"proc2"**)); } **catch** (Exception e) { e.printStackTrace(); } }**private** RegionCoprocessorEnvironment **env** = **null**;**private static final** String ***FAMAILLY_NAME*** = **"info"**;**private static final** String ***QUALIFIER_NAME*** = **"name"**;*//2.0 add this method, otherwise it will not take effect * @ override * * public * * optional < regionobserver > getregionobserver() {* / / extremely important to be sure that the coprocessor is invoked as a regionobserver * * * return * * optional. * of * (* * this * *); } @Override**public void** start(CoprocessorEnvironment e) **throws** IOException {**env** = (RegionCoprocessorEnvironment) e; } @Override**public void** stop(CoprocessorEnvironment e) **throws** IOException {*// nothing to do here*}*/** * Override the prePut method to intercept before we insert data, * **@param e*** **@param put** put Object encapsulates the data we need to insert into the target table * **@param edit*** **@param durability*** **@throws** IOException */*@Override**public void** prePut(**final** ObserverContext<RegionCoprocessorEnvironment> e,**final** Put put, **final** WALEdit edit, **final** Durability durability)**throws** IOException {**try** {*//Get the rowkey * * * byte * * [] rowbytes of the inserted data through the put object = put. Getrow(); String rowkey = Bytes.*toString*(rowBytes);*//Get the value of the name field * list < cell > List = put.get (bytes. * tobytes * (* * * factory_name * * *), bytes. * tobytes * (* * * qualifier_name * * *)); * * if * * (list = = * * null * * | list. Size() = = 0) {* * return * *; }*//Get the info column family and the cell*Cell cell2 = list.get(0) corresponding to the name column; * / / get the data value * String nameValue = Bytes.*toString*(CellUtil.*cloneValue*(cell2)); * / / create a put object and insert the data into the proc2 table * Put put2 = **new** Put(rowkey.getBytes()); put2.addColumn(Bytes.*toBytes*(***FAMAILLY_NAME***), Bytes.*toBytes*(***QUALIFIER_NAME***), nameValue.getBytes());*table*.put(put2);*table*.close(); } **catch** (Exception e1) {**return** ; } } }
Step 4: type the project into a jar package and upload it to HDFS
Print our coprocessor into a jar package. There is no need to use any packaging plug-ins here, and then upload it to hdfs
Upload the jar package to the / export/servers path of linux
cd /export/servers mv original-hbase-1.0-SNAPSHOT.jar processor.jar hdfs dfs -mkdir -p /processor hdfs dfs -put processor.jar /processor
Step 5: mount the jar package into the proc1 table
hbase(main):056:0> **describe 'proc1'** hbase(main):055:0> **alter 'proc1',METHOD => 'table_att','Coprocessor'=>'hdfs://node01:8020/processor/processor.jar|cn.itcast.hbasemr.demo4.MyProcessor|1001|'** View again'proc1'Watch, hbase(main):043:0> describe 'proc1'
You can see that our unloader has been loaded
Step 6: add data to proc1 table
Enter the HBase shell client, and then directly execute the following command to add data to the proc1 table
put 'proc1','0001','info:name','zhangsan' put 'proc1','0001','info:age','28' put 'proc1','0002','info:name','lisi' put 'proc1','0002','info:age','25'
Add data to the proc1 table, and then
scan 'proc2'
We will find that data is also inserted in the proc2 table, and there are only info column family and name column
Note: if you need to uninstall our coprocessor, enter the shell command line of hbase and execute the following commands
disable 'proc1' alter 'proc1',METHOD=>'table_att_unset',NAME=>'coprocessor$1' enable 'proc1'
18. Basic introduction of secondary index in HBase
Since the query of HBase is relatively weak, it is basically impossible or difficult to implement complex statistical requirements such as select name,salary,count(1),max(salary) from user group by name,salary order by salary, etc. when we use HBase, we generally use the secondary index scheme
The primary index of HBase is rowkey, and we can only retrieve it through rowkey. If we make some combined queries with the columns of the column family in HBase, we need to use the secondary index scheme of HBase for multi condition queries.
-
MapReduce scheme
-
ITHBASE (indexed transitional HBase) scheme
-
IHBASE (Index HBase) scheme
-
HBase coprocessor scheme
-
Solr+hbase scheme
-
CCIndex (comprehensive clustering index) scheme
Common secondary indexes can be implemented in various other ways, such as Phoenix, solr or ES
19. HBase integration hue
1. Introduction to Hue
HUE=Hadoop User Experience
When there is no HUE, if we want to view the status of each component of the Hadoop ecosystem, we can use their webconsole addresses:
**HDFS: NameNode Webpage http://ip:50070** **SecondaryNameNode Webpage: http://ip:50090** **Yarn: http://ip:8088** **HBase: http://ip:16010** **Hive http://ip:9999/hwi/** **Spark http://ip:8080**
It's certainly possible to check one by one, but... It's time-consuming and a little troublesome. HUE is the integration of these. You can view the status of all the above components and carry out some operations in one place of HUE.
Hue is an open source Apache Hadoop UI system, which evolved from Cloudera Desktop. Finally, Cloudera company contributed it to the Hadoop community of Apache foundation, which is implemented based on the Python Web framework Django.
By using Hue, we can interact with Hadoop cluster on the browser side Web console to analyze and process data, such as operating data on HDFS, running MapReduce Job, executing Hive SQL statement, browsing HBase database, etc.
HUE link
- Site: http://gethue.com/
- Github: https://github.com/cloudera/hue
- Reviews: https://review.cloudera.org
Hue's architecture
Core functions
- SQL editor, support Hive, Impala, MySQL, Oracle, PostgreSQL, SparkSQL, Solr SQL, Phoenix
- Various charts of search engine Solr
- Friendly interface support for Spark and Hadoop
- Support the scheduling system Apache Oozie, which can edit and view workflow
These functions provided by HUE are more friendly than the interfaces provided by various components of Hadoop ecology, but some scenarios requiring debug ging may still need to use the native system to find the cause of the error more deeply.
When viewing Oozie workflow in HUE, you can also easily see the DAG diagram of the whole workflow. However, the DAG diagram has been removed in the latest version. You can only see the action list in workflow and the jump relationship between them. Those who want to see the DAG diagram can still use oozie's native interface system.
1. Access HDFS and file browsing
2. Through web debugging and development, hive and data result display
3. Query solr and result display, report generation
4. Debug and develop impala interactive SQL Query through web
5. spark debugging and development
7. oozie task development, monitoring, and workflow coordination and scheduling
8. Hbase data query and modification, data display
9. Hive metadata query
10. MapReduce task progress viewing and log tracking
11. Create and submit MapReduce, Streaming and Java job tasks
12. Development and debugging of Sqoop2
13. Browsing and editing of Zookeeper
14. Query and display of databases (MySQL, PostGres, SQlite, Oracle)
One sentence summary: Hue is a friendly interface integration framework, which can integrate various frameworks we have learned and will learn. One interface can view and execute all frameworks
2. Environmental preparation and installation of Hue
Hue can be installed in many ways, including rpm package, tar.gz package and cloudera manager. We use tar.gz package to install here
Step 1: Download dependent packages
The node03 server executes the following command to download dependent packages online
yum install ant asciidoc cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-plain gcc gcc-c++ krb5-devel libffi-devel libxml2-devel libxslt-devel make mysql mysql-devel openldap-devel python-devel sqlite-devel gmp-devel openssl-devel -y
Step 2: install and configure maven
To compile hue, you need to use maven to download some other jar packages. Here we can install maven on the node03 server
node03 execute the following command to install maven
wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo
sed -i s/$releasever/6/g /etc/yum.repos.d/epel-apache-maven.repo
yum install -y apache-maven
mvn --version
Configure the download address of maven
vim /etc/maven/settings.xml
<mirror> <id>alimaven</id> <mirrorOf>central</mirrorOf> <name>aliyun maven</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> </mirror> <mirror> <id>ui</id> <mirrorOf>central</mirrorOf> <name>Human Readable Name for this Mirror.</name> <url>http://uk.maven.org/maven2/</url> </mirror> <mirror> <id>jboss-public-repository-group</id> <mirrorOf>central</mirrorOf> <name>JBoss Public Repository Group</name> <url>http://repository.jboss.org/nexus/content/groups/public</url> </mirror>
Step 3: add ordinary users to the linux operating system
For the installation of hue, you must add a normal user hue. Otherwise, an error is reported at startup. You can directly add a normal user to the node03 server
useradd hue
passwd hue
Step 4: download the hue compressed package and upload and unzip it
The node03 server executes the following command to download and unzip the installation package
cd /export/softwares wget [http://gethue.com/downloads/releases/4.0.1/hue-4.0.1.tgz](http://gethue.com/downloads/releases/4.0.1/hue-4.0.1.tgz)
tar -zxf hue-4.0.1.tgz -C /export/servers/
Step 5: modify the configuration file
Modify hue's configuration file hue.ini
node03 executes the following command to modify hue's configuration file
cd /export/servers/hue-4.0.0/desktop/conf/
vim hue.ini
#General configuration
[desktop] secret_key=jFE93j;2[290-eiw.KEiwN2s3['d;/.q[eIW^y#e=+Iei*@Mn<qW5o http_host=node03.hadoop.com time_zone=Asia/Shanghai server_user=root server_group=root default_user=root default_hdfs_superuser=root #The configuration uses mysql as the storage database of hue, which is about 561 lines of hue.ini [[database]] engine=mysql host=node03.hadoop.com port=3306 user=root password=123456 name=hue
Step 6: create a mysql database
Enter the mysql client and create the mysql database
mysql –uroot -p establish hue database create database hue default character set utf8 default collate utf8_general_ci;
Step 7: compile hue
node03 Execute the following command to compile cd /export/servers/hue-3.9.0-cdh5.14.0 make apps Note: if the compilation fails, it needs to be compiled again. Compile several times more, and the network speed is fast enough to pass the compilation make clean make apps
Step 8: start the hue service and access the page
node03 executes the following command to start the service
cd /export/servers/hue-3.9.0-cdh5.14.0 build/env/bin/supervisor
Page access
For the first access, you need to set the administrator user and password
The user name and password of our administrator here shall be consistent with the user name and password of our hadoop installation as far as possible,
The user name and password for installing hadoop are root 123456 respectively
The root user is used for the first login, and the password is 123456
After entering, we found that our hue page reported an error. This error is mainly due to hive. Because there was an error during the integration of hue and hive, we need to configure our hue and hive for integration. Next, let's see how our hue and hive and hadoop integrate
3. Integration of hue with other frameworks
3.1. HDFS and yarn integration of hue and hadoop
Step 1: change the core-site.xml configuration of all hadoop nodes
Remember to restart the hdfs and yarn clusters after changing the core-site.xml
Three machines change core-site.xml
<property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property>
Step 2: change the hdfs-site.xml of all hadoop nodes
<property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property>
Step 3: restart hadoop cluster
Execute the following command on the node01 machine
cd /export/servers/hadoop-2.7.5 sbin/stop-dfs.sh sbin/start-dfs.sh sbin/stop-yarn.sh sbin/start-yarn.sh
Step 4: stop the service of hue and continue to configure hue.ini
cd /export/servers/hue-3.9.0-cdh5.14.0/desktop/conf
vim hue.ini
Configure our hue to integrate with hdfs
[[hdfs_clusters]] [[[default]]] fs_defaultfs=hdfs://node01.hadoop.com:8020 webhdfs_url=http://node01.hadoop.com:50070/webhdfs/v1 hadoop_hdfs_home=/export/servers/hadoop-2.7.5 hadoop_bin=/export/servers/hadoop-2.7.5/bin hadoop_conf_dir=/export/servers/hadoop-2.7.5/etc/hadoop Configure our hue And yarn integrate [[yarn_clusters]] [[[default]]] resourcemanager_host=node01 resourcemanager_port=8032 submit_to=True resourcemanager_api_url=http://node01:8088 history_server_api_url=http://node01:19888
3.2. Configure hue and hive integration
If we need to configure the integration of hue and hive, we need to start hiveserver2 service of hive
Change hue's configuration hue.ini
Modify hue.ini
[beeswax] hive_server_host=node03.hadoop.com hive_server_port=10000 hive_conf_dir=/export/servers/apache-hive-2.1.0-bin/conf server_conn_timeout=120 auth_username=root auth_password=123456 [metastore] #You can use hive to create database tables and other operations enable_new_create_table=true
Start hive's metastore service
Go to node03 and start hiveserver2 service of hive
cd /export/servers/apache-hive-2.1.0-bin
nohup bin/hive --service hiveserver2 &
Restart hue, and then you can operate hive through the browser page
3.5 integration of hue and HBase
Step 1: modify hue.ini
cd /export/servers/hue-3.9.0-cdh5.14.0/desktop/conf
vim hue.ini
[hbase] hbase_clusters=(Cluster|node01:9090) hbase_conf_dir=/export/servers/hbase-2.0.0/conf
Step 2: start the thrift server service of hbase
The first machine starts the thriftserver of hbase by executing the following command
cd /export/servers/hbase-2.0.0
bin/hbase-daemon.sh start thrift
Step 3: start hue
The third machine executes the following command to start hue
cd /export/servers/hue-3.9.0-cdh5.14.0/
build/env/bin/supervisor
Step 4: page access
20. HBase tuning
1. General optimization
1. The metadata backup of NameNode uses SSD
2. The metadata on the NameNode is backed up regularly, hourly or daily. If the data is extremely important, it can be backed up every 5 ~ 10 minutes. The backup can copy the metadata directory through the scheduled task.
3. Specify multiple metadata directories for NameNode, using dfs.name.dir or dfs.namenode.name.dir. One specifies the local disk and one specifies the network disk. This can provide redundancy and robustness of metadata to avoid failure.
4. Set dfs.namenode.name.dir.restore to true to allow you to attempt to restore the dfs.namenode.name.dir directory that failed before. Do this when creating a checkpoint. If multiple disks are set, it is recommended to allow it.
5. The NameNode node must be configured as a RAID1 (mirror disk) structure.
6. Supplement: what are Raid0, Raid0+1, Raid1 and Raid5
Standalone
The most common single disk storage method.
Cluster
Cluster storage is a storage method of distributing data to each node in the cluster, providing a single user interface and interface, so that users can easily use and manage all data in a unified way.
Hot swap
Users can remove and replace the hard disk without shutting down the system or cutting off the power supply, so as to improve the recovery ability, expansibility and flexibility of the system.
Raid0
Raid0 is the most powerful storage array of all raid types. Its working principle is to access continuous data on multiple disks. In this way, when data needs to be accessed, multiple disks can execute side by side, and each disk executes its own part of data requests, which significantly improves the overall access performance of the disk. However, it does not have fault tolerance and is suitable for desktop systems with low cost and low reliability.
Raid1
Also known as mirror disk, it mirrors the data of one disk to another disk, adopts mirror fault tolerance to improve reliability, and has the highest data redundancy capability in raid. When storing data, the data will be written into the mirror disk at the same time, and the read data will only be read from the working disk. In case of failure, the system will read data from the mirror disk, and then restore the correct data of the working disk. This array is highly reliable, but its capacity will be reduced by half. It is widely used in applications with strict data requirements, such as commercial finance, file management and other fields. Only one hard disk is allowed to fail.
Raid0+1
Combine Raid0 and Raid1 technologies, taking into account their advantages. While the data is guaranteed, it can also provide strong storage performance. However, at least 4 or more hard disks are required, but only one disk error is allowed. It is a three high technology.
Raid5
RAID5 can be seen as a low-cost solution for Raid0+1. The array mode of cyclic even check independent access is adopted. The data and the corresponding parity information are distributed and stored on each disk constituting RAID5. When one of the disk data is damaged, use the remaining disk and corresponding parity information to recover / generate the lost data without affecting the data availability. At least 3 or more hard disks are required. It is suitable for operation with large amount of data. Array mode with slightly higher cost, strong storage and strong reliability.
There are other ways of RAID, please check it yourself.
7. Keep enough space in the NameNode log directory to help you find problems.
8. Because Hadoop is an IO intensive framework, try to improve the storage speed and throughput (similar to bit width).
2. Linux optimization
1. Enabling the read ahead cache of the file system can improve the reading speed
$ sudo blockdev --setra 32768 /dev/sda
(scream tip: ra is the abbreviation of readahead)
2. Turn off process sleep pool
$ sudo sysctl -w vm.swappiness=0
3. Adjust the ulimit upper limit. The default value is a relatively small number
$ulimit -n view the maximum number of processes allowed
$ulimit -u view the maximum number of files allowed to open
Modification:
$sudo vi /etc/security/limits.conf modify the limit on the number of open files
Add at the end:
- soft nofile 1024000
- hard nofile 1024000
Hive - nofile 1024000
hive - nproc 1024000
$sudo vi /etc/security/limits.d/20-nproc.conf modify the limit on the number of processes opened by the user
Amend to read:
#* soft nproc 4096
#root soft nproc unlimited
- soft nproc 40960
root soft nproc unlimited
4. Turn on the time synchronization NTP of the cluster. Please refer to the previous document
5. Update the system patch (prompt: before updating the patch, please test the compatibility of the new version patch to the cluster nodes)
3. HDFS optimization (HDFS site. XML)
1. Ensure that RPC calls will have a large number of threads
Attribute: dfs.namenode.handler.count
Explanation: this attribute is the default number of threads of NameNode service. The default value is 10. It can be adjusted to 50 ~ 100 according to the available memory of the machine
Attribute: dfs.datanode.handler.count
Explanation: the default value of this attribute is 10, which is the number of processing threads of the DataNode. If the HDFS client program has many read-write requests, it can be adjusted to 1520. The larger the set value, the more memory consumption. Do not adjust it too high. In general business, 510 is enough.
2. Adjustment of the number of copies
Attribute: dfs.replication
Explanation: if the amount of data is huge and not very important, it can be adjusted to 23. If the data is very important, it can be adjusted to 35.
3. Adjustment of file block size
Attribute: dfs.blocksize
Explanation: for block size definition, this attribute should be set according to the size of a large number of single files stored. If a large number of single files are less than 100M, it is recommended to set it to 64M block size. For cases greater than 100M or reaching GB, it is recommended to set it to 256M. Generally, the setting range fluctuates between 64M and 256M.
4. MapReduce optimization (mapred site. XML)
1. Adjust the number of Job task service threads
mapreduce.jobtracker.handler.count
This attribute is the number of Job task threads. The default value is 10. It can be adjusted to 50 ~ 100 according to the available memory of the machine
2. Number of Http server worker threads
Attribute: mapreduce.tasktracker.http.threads
Explanation: define the number of HTTP server working threads. The default value is 40. For large clusters, it can be adjusted to 80 ~ 100
3. File sorting merge optimization
Attribute: mapreduce.task.io.sort.factor
Explanation: the number of data streams merged simultaneously during file sorting, which also defines the number of files opened at the same time. The default value is 10. If you increase this parameter, you can significantly reduce disk IO, that is, reduce the number of file reads.
4. Set task concurrency
Attribute: mapreduce.map.special
Explanation: this attribute can set whether tasks can be executed concurrently. If there are many but small tasks, setting this attribute to true can significantly speed up task execution efficiency. However, for tasks with very high delay, it is recommended to change it to false, which is similar to thunderbolt download.
5. Compression of MR output data
Properties: mapreduce.map.output.compress, mapreduce.output.fileoutputformat.compress
Explanation: for large clusters, it is recommended to set the output of map reduce to compressed data, but not for small clusters.
6. Number of optimized Mapper and Reducer
Properties:
mapreduce.tasktracker.map.tasks.maximum
mapreduce.tasktracker.reduce.tasks.maximum
Explanation: the above two attributes are the number of maps and Reduce that a single Job task can run simultaneously.
When setting the above two parameters, you need to consider the number of CPU cores, disk and memory capacity. Suppose an 8-core CPU, and the business content consumes CPU very much, then the number of maps can be set to 4. If the business does not particularly consume CPU, then the number of maps can be set to 40 and the number of reduce can be set to 20. After modifying the values of these parameters, you must observe whether there are tasks waiting for a long time. If so, you can reduce the number to speed up task execution. If you set a large value, it will cause a lot of context switching and data exchange between memory and disk. There is no standard configuration value here, Choices need to be made based on business and hardware configuration and experience.
At the same time, do not run too many MapReduce at the same time, which will consume too much memory, and the task will execute very slowly. We need to set a maximum value of MR task concurrency according to the number of CPU cores and memory capacity, so that the task with a fixed amount of data can be fully loaded into memory, so as to avoid frequent memory and disk data exchange, so as to reduce disk IO and improve performance.
Approximate ratio:
Approximate estimation formula:
map = 2 + ⅔cpu_core
reduce = 2 + ⅓cpu_core
5. HBase optimization
1. Append content to HDFS file
Isn't it not allowed to add content? Yes, look at the background story:
Attribute: dfs.support.append
Files: hdfs-site.xml, hbase-site.xml
Explanation: enabling HDFS additional synchronization can cooperate with HBase data synchronization and persistence. The default value is true.
2. Optimize the maximum number of file openings allowed for DataNode
Attribute: dfs.datanode.max.transfer.threads
File: hdfs-site.xml
Explanation: HBase generally operates a large number of files at one time. It is set to 4096 or higher according to the number and scale of clusters and data actions. Default: 4096
3. Optimize latency for data operations with high latency
Attribute: dfs.image.transfer.timeout
File: hdfs-site.xml
Explanation: if the delay is very high for a data operation and the socket needs to wait longer, it is recommended to set this value to a larger value (60000 milliseconds by default) to ensure that the socket will not be timed out.
4. Optimize data write efficiency
Properties:
mapreduce.map.output.compress
mapreduce.map.output.compress.codec
File: mapred-site.xml
Explanation: opening these two data can greatly improve the file writing efficiency and reduce the writing time. The first attribute value is modified to true, and the second attribute value is modified to org.apache.hadoop.io.compress.GzipCodec
5. Optimize DataNode storage
Attribute: dfs.datanode.failed.volumes.summarized
File: hdfs-site.xml
Explanation: the default value is 0, which means that when a disk in a DataNode fails, it will be considered that the DataNode has been shut down. If it is changed to 1, when a disk fails, the data will be copied to other normal datanodes, and the current DataNode will continue to work.
6. Set the number of RPC listeners
Attribute: hbase.regionserver.handler.count
File: hbase-site.xml
Explanation: the default value is 30. It is used to specify the number of RPC listeners. It can be adjusted according to the number of requests from the client. This value is increased when there are many read-write requests.
7. Optimize hsstore file size
Attribute: hbase.hregion.max.filesize
File: hbase-site.xml
Explanation: the default value is 10737418240 (10GB). If you need to run the MR task of HBase, you can reduce this value because a region corresponds to a map task. If a single region is too large, the execution time of the map task will be too long. This value means that if the size of hfile reaches this value, the region will be divided into two hfiles.
8. Optimize hbase client cache
Attribute: hbase.client.write.buffer
File: hbase-site.xml
Explanation: it is used to specify the HBase client cache. Increasing this value can reduce the number of RPC calls, but it will consume more memory. Otherwise, it will be the opposite. Generally, we need to set a certain cache size to reduce the number of RPCs.
9. Specifies the number of rows obtained by scan.next scanning HBase
Attribute: hbase.client.scanner.caching
File: hbase-site.xml
Explanation: used to specify the default number of rows obtained by the scan.next method. The larger the value, the greater the memory consumption.
6. Memory optimization
HBase operation requires a lot of memory overhead. After all, tables can be cached in memory. Generally, 70% of the whole available memory will be allocated to the Java heap of HBase. However, it is not recommended to allocate very large heap memory, because if the GC process lasts too long, the RegionServer will be unavailable for a long time. Generally, 16~48G memory is enough. If the system memory is insufficient because the framework occupies too much memory, the framework will also be dragged to death by the system service.
7. JVM optimization
File involved: hbase-env.sh
1. Parallel GC
Parameters: - XX:+UseParallelGC
Explanation: turn on parallel GC
2. Number of threads simultaneously processing garbage collection
Parameter: - XX:ParallelGCThreads=cpu_core – 1
Explanation: this property sets the number of threads that process garbage collection at the same time.
3. Disable manual GC
Parameter: - XX:DisableExplicitGC
Explanation: prevents developers from manually invoking GC
8. Zookeeper optimization
1. Optimize Zookeeper session timeout
Parameter: zookeeper.session.timeout
File: hbase-site.xml
Explanation: in hbase-site.xml, set zoomeeper.session.timeout to 30 seconds or less to bound failure detection (20-30 seconds is a good start). This value is directly related to the maximum cycle of server downtime found by the master. The default value is 30 seconds. If this value is too small, the RegionServer will be temporarily unavailable when HBase writes a large amount of data and GC occurs, Thus, no heartbeat packet is sent to ZK, which eventually leads to the slave node shutdown. Generally, about 20 clusters need to be equipped with 5 zookeeper s.