Basic principle and application of Hbase (Part 2)

HBase (medium)

12. Integration of HBase and MapReduce

The data in HBase is ultimately stored on HDFS. HBase naturally supports Mr operation. We can directly process the data in HBase through Mr, and MR can directly store the processed results in HBase

Requirement: read the data of one table in HBase, and then write the data to another table in HBase. Note: we can use TableMapper and TableReducer to read and write data from HBase

http://hbase.apache.org/2.0/book.html#mapreduce

Requirement 1: read the data in the myuser table and write it to another table in HBase

Here, we write the name and age fields of the f1 column family in the myuser table into the f1 column family in the myuser2 table

Step 1: create the myuser2 table

Note: the name of the column family should be the same as that of the column family in the myuser table

hbase(main):010:0> create 'myuser2','f1'

Step 2: create maven project and import jar package

Note: on the basis of importing jar packages from previous projects, add the following jar packages

<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-*mapreduce -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-mapreduce</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version> 2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.5</version>
</dependency>

Step 3: develop MR program

Define mapper class

public class HBaseMapper extends TableMapper<Text,Put> {

/**

- @param key rowkey
- @param value Encapsulates our line of data
- @param context
- @throws IOException
- @throws InterruptedException
- **/

@Override

protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {

// f1 name age f2 xxx

//Get our rowkey

byte[] bytes = key.get();

Put put = new Put(bytes);

//Get all the columns in the Result

List<Cell> cells = value.listCells();

for (Cell cell : cells) {

//Determine which column family you belong to

byte[] family = CellUtil.cloneFamily(cell);

//Get which column the cell belongs to

byte[] qualifier = CellUtil.cloneQualifier(cell);

if(Bytes.toString(family).equals("f1")){

if(Bytes.toString(qualifier).equals("name") || Bytes.toString(qualifier).equals("age")){

put.add(cell);

}

}

}

if(!put.isEmpty()){

context.write(new Text(Bytes.toString(bytes)),put);

}

}

}

Define reducer class

/**

- Text key2 Type of
- Put value2 type
- ImmutableBytesWritable k3 Type of
- V3 Type of information???
- put 'myuser2','rowkey','f1:name','zhangsan'
- javaAPI Write through the put object
- 
**/

public class HBaseReducer extends TableReducer<Text,Put,ImmutableBytesWritable> {

/**

- 
- @param key It's our key2
- @param values It's our v2
- @param context Write our data out
- @throws IOException
- @throws InterruptedException
-**/

@Override

protected void reduce(Text key, Iterable<Put> values, Context context) throws IOException, InterruptedException {

for (Put put : values) {

context.write(new ImmutableBytesWritable(key.toString().getBytes()),put);

}

}

}

Define the main method for program running

public class HBaseMrMain extends Configured implements Tool {

@Override

public int run(String[] args) throws Exception {

Job job = Job.getInstance(super.getConf(), "hbaseMR");

Scan scan = new Scan();

//Use the tool to initialize our mapper class

/**

- String table, Scan scan,

Class<? extends TableMapper> mapper,

Class<?> outputKeyClass,

Class<?> outputValueClass, Job job

- **/

TableMapReduceUtil.initTableMapperJob("myuser",scan,HBaseMapper.class, Text.class, Put.class,job);

/**

- String table,

Class<? extends TableReducer> reducer, Job job

- **/

TableMapReduceUtil.initTableReducerJob("myuser2",HBaseReducer.class,job);

boolean b = job.waitForCompletion(true);

return b?0:1;

}

public static void main(String[] args) throws Exception {

Configuration configuration = HBaseConfiguration.create();

configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181");

int run = ToolRunner.run(configuration, new HBaseMrMain(), args);

System.exit(run);

}

}

Step 4: run

The first mode of operation: local operation

Directly select the class where the main method is located and run it

The second mode of operation: package cluster operation

Note that we need to use the packaging plug-in to put the dependent jar packages of HBase into the project jar package

Step 1: add a packaging plug-in to pom.xml

<plugin>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-shade-plugin</artifactId>

<version>2.4.3</version>

<executions>

<execution>

<phase>package</phase>

<goals>

<goal>shade</goal>

</goals>

<configuration>

<minimizeJar>true</minimizeJar>

</configuration>

</execution>

</executions>

</plugin>

Step 2: add

job.setJarByClass(HBaseMain.class);

Step 3: package with maven

Then package, upload the jar package to the linux server, and execute

yarn jar hbaseStudy-1.0-SNAPSHOT.jar cn.itcast.hbasemr.HBaseMR

Or we can set our own environment variables and run the smaller jar package of original

export HADOOP_HOME=/export/servers/hadoop-2.7.5/

export HBASE_HOME=/export/servers/hbase-2.0.0/

export HADOOP_CLASSPATH=${HBASE_HOME}/bin/hbase mapredcp

yarn jar original-hbaseStudy-1.0-SNAPSHOT.jar cn.itcast.hbasemr.HbaseMR

Requirement 2: read the HDFS file and write it into the HBase table

Read the hdfs path / hbase/input/user.txt, and then write the data to the myuser2 table

Step 1: prepare data files

Prepare data files and upload them to HDFS

hdfs dfs -mkdir -p /hbase/input

cd /export/servers/

vim user.txt

0007 zhangsan 18

0008 lisi 25

0009 wangwu 20

Upload to hdfs Go up

hdfs dfs -put user.txt /hbase/input

Step 2: develop MR program

Define mapper class

public class HdfsMapper extends Mapper<LongWritable,Text,Text,NullWritable> {

/*

We did not do any processing in the map phase and directly wrote out our data

**/

@Override

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

context.write(value,NullWritable.get());

}

}


Define reducer class

public class HBaseWriteReducer extends TableReducer<Text,NullWritable,ImmutableBytesWritable> {

/**

- @param key
- @param values
- @param context
- @throws IOException
- @throws InterruptedException
-**/

@Override

protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {

String[] split = key.toString().split("\t");

Put put = new Put(split[0].getBytes());

put.addColumn("f1".getBytes(),"name".getBytes(),split[1].getBytes());

put.addColumn("f1".getBytes(),"age".getBytes(),split[2].getBytes());

//Immutablebytesworitable can install our rowkey or our value value, etc

context.write(new ImmutableBytesWritable(split[0].getBytes()),put);

}

}

Define the main method for program running

public class Hdfs2HBaseMain extends Configured implements Tool {

@Override

public int run(String[] args) throws Exception {

Job job = Job.getInstance(super.getConf(), "hdfsToHbase");

//Step 1: read the file and parse it into key and value pairs

job.setInputFormatClass(TextInputFormat.class);

TextInputFormat.addInputPath(job,new Path("hdfs://node01:8020/hbase/input"));

//Step 2: customize the mapper to receive k1 v1 and convert it into a new k2 v2 output

job.setMapperClass(HdfsMapper.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(NullWritable.class);

//Step 3: partition

//Step 4: sort

//Step 5: Protocol

//Step 6: grouping

//Step 7: reduce logic, receive K2 v2 and convert it into a new k3 v3 output

TableMapReduceUtil.initTableReducerJob("myuser2",HBaseWriteReducer.class,job);

//Step 8: output data

boolean b = job.waitForCompletion(true);

return b?0:1;

}

public static void main(String[] args) throws Exception {

Configuration configuration = HBaseConfiguration.create();

configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181");

int run = ToolRunner.run(configuration, new Hdfs2HBaseMain(), args);

System.exit(run);

}

}

public class Hdfs2Hbase extends Configured implements Tool{

@Override

public int run(String[] args) throws Exception {

Job job = Job.getInstance(super.getConf(), "hdfs2Hbase");

job.setJarByClass(Hdfs2Hbase.class);

job.setInputFormatClass(TextInputFormat.class);

TextInputFormat.addInputPath(job,new Path("hdfs://node01:8020/hbase/input"));

job.setMapperClass(HdfsMapper.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(NullWritable.class);

TableMapReduceUtil.initTableReducerJob("myuser2",HBaseReducer.class,job);

job.setNumReduceTasks(1);

boolean b = job.waitForCompletion(true);

return b?0:1;

}

public static void main(String[] args) throws Exception {

Configuration configuration = HBaseConfiguration.create();

int run = ToolRunner.run(configuration, new Hdfs2Hbase(), args);

System.exit(run);

}

public static class HdfsMapper extends Mapper<LongWritable,Text,Text,NullWritable>{

@Override

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

context.write(value,NullWritable.get());

}

}

public static class HBaseReducer extends TableReducer<Text,NullWritable,ImmutableBytesWritable>{

@Override

protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {

String[] split = key.toString().split("\t");

Put put = new Put(Bytes.toBytes(split[0]));

put.addColumn("f1".getBytes(),"name".getBytes(),split[1].getBytes());

put.addColumn("f1".getBytes(),"age".getBytes(),Bytes.toBytes(Integer.parseInt(split[2])));

context.write(new ImmutableBytesWritable(Bytes.toBytes(split[0])),put);

}

}

}

Requirement 3: bulk load data into HBase

There are many ways to load data into HBase. We can use the Java API of HBase or sqoop to write or import our data into HBase, but these methods are either slow or occupy Region resources in the import process, resulting in low efficiency. We can also use the MR program, Directly convert our data into the final storage format HFile of HBase, and then directly load the data into HBase

Each Table in HBase is stored in a folder under the root directory (/ HBase). The Table name is the folder name. Under the Table folder, each Region is also stored in a folder. Each column family under each Region folder is also stored in a folder, and some HFile files are stored under each column family. HFile is the storage format of HBase data under HFDS, Therefore, the final representation of HBase storage files on hdfs is HFile. If we can directly convert the data to HFile format, our HBase can directly read the files loaded in HFile format

advantage:

1. The import process does not occupy Region resources

2. It can quickly import massive data

3. Save memory

Normal reading and writing process of HBase data

Use bulkload to directly generate our data into HFile format, and then directly load it into the table of HBase

Requirements: convert the data file of the path / hbase/input/user.txt above hdfs into HFile format, and then load it into the table myuser2

Step 1: define our mapper class

/**

- LongWritable k1 type
- Text V1 type
- ImmutableBytesWritable rowkey
- Put Inserted objects
- **/

public class BulkLoadMapper extends Mapper<LongWritable,Text,ImmutableBytesWritable,Put> {

@Override

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String[] split = value.toString().split("\t");

Put put = new Put(split[0].getBytes());

put.addColumn("f1".getBytes(),"name".getBytes(),split[1].getBytes());

put.addColumn("f1".getBytes(),"age".getBytes(),split[2].getBytes());

context.write(new ImmutableBytesWritable(split[0].getBytes()),put);

}

}

### Step 2: develop our main program entry class

public class BulkLoadMain extends Configured implements Tool {

@Override

public int run(String[] args) throws Exception {

Configuration conf = super.getConf();

Connection connection = ConnectionFactory.createConnection(conf);

Table table = connection.getTable(TableName.valueOf("myuser2"));

Job job = Job.getInstance(conf, "bulkLoad");

//Read the file and parse it into key and value pairs

job.setInputFormatClass(TextInputFormat.class);

TextInputFormat.addInputPath(job,new Path("hdfs://node01:8020/hbase/input"));

//Define our mapper class

job.setMapperClass(BulkLoadMapper.class);

job.setMapOutputKeyClass(ImmutableBytesWritable.class);

job.setMapOutputValueClass(Put.class);

//The reduce process is also omitted

/**

- Job job, Table table, RegionLocator regionLocator
- Configure incremental load is used to configure which column family in which table our HFile is loaded
- **/

HFileOutputFormat2.configureIncrementalLoad(job,table,connection.getRegionLocator(TableName.valueOf("myuser2")));

//Set our output type and output our data in HFile format

job.setOutputFormatClass(HFileOutputFormat2.class);

//Set our output path

HFileOutputFormat2.setOutputPath(job,new Path("hdfs://node01:8020/hbase/hfile_out"));

boolean b = job.waitForCompletion(true);

return b?0:1;

}

public static void main(String[] args) throws Exception {

Configuration configuration = HBaseConfiguration.create();

configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181");

int run = ToolRunner.run(configuration, new BulkLoadMain(), args);

System.exit(run);

}

}

Step 3: type the code into a jar package and run it

yarn jar original-hbaseStudy-1.0-SNAPSHOT.jar cn.itcast.hbasemr.HBaseLoad

Step 4: develop code and load data

Load the HFile file under our output path into our hbase table

public class LoadData {

public static void main(String[] args) throws Exception {

Configuration configuration = HBaseConfiguration.create();

configuration.set("hbase.zookeeper.property.clientPort", "2181");

configuration.set("hbase.zookeeper.quorum", "node01,node02,node03");

Connection connection = ConnectionFactory.createConnection(configuration);

Admin admin = connection.getAdmin();

Table table = connection.getTable(TableName.valueOf("myuser2"));

LoadIncrementalHFiles load = new LoadIncrementalHFiles(configuration);

load.doBulkLoad(new Path("hdfs://node01:8020/hbase/output_hfile"), admin,table,connection.getRegionLocator(TableName.valueOf("myuser2")));

}

}

Or we can load data through the command line

First, add the jar package of hbase to the classpath of hadoop

export HBASE_HOME=/export/servers/hbase-2.0.0/

export HADOOP_HOME=/export/servers/hadoop-2.7.5/

export HADOOP_CLASSPATH=${HBASE_HOME}/bin/hbase mapredcp

Then execute the following command to import the HFile of hbase directly into the table myuser2

yarn jar /export/servers/hbase-2.0.0/lib/hbase-server-1.2.0-cdh5.14.0.jar completebulkload /hbase/hfile_out myuser2

13. Comparison of HBase and hive

Hive

Data warehouse tools

The essence of Hive is actually equivalent to making a bijection relationship between the files already stored in HDFS in Mysql to facilitate the use of HQL to manage queries.

For data analysis and cleaning

Hive is suitable for offline data analysis and cleaning with high delay

Based on HDFS and MapReduce

The data stored in Hive is still on the DataNode, and the HQL statement written will eventually be converted into MapReduce code for execution.

HBase

nosql database

Is a non relational database for column storage.

Used to store structured and unstructured data

It is applicable to the storage of single table non relational data. It is not suitable for association query, such as JOIN and so on.

Based on HDFS

The embodiment of data persistence storage is Hfile, which is stored in DataNode and managed by ResionServer in the form of region.

Low delay, access to online services

In the face of a large amount of enterprise data, HBase can store a large amount of data in a straight-line single table, and provide efficient data access speed at the same time.

Summary: Hive and HBase

Hive and HBase are two different technologies based on Hadoop. Hive is a kind of SQL Engine and runs MapReduce tasks. HBase is a NoSQL key / value database based on Hadoop. These two tools can be used at the same time. Just like using Google to search and FaceBook to socialize, hive can be used for statistical query, HBase can be used for real-time query, and data can also be written from hive to HBase or from HBase to hive.

14. Integration of hive and HBase

Hive and our HBase have their own advantages and different functions, but in the final analysis, the data of hive and HBase are ultimately stored on hdfs. Generally, in order to store disk space, we do not store one piece of data in multiple places, resulting in a waste of disk space. We can directly store the data in HBase, Then integrate HBase through hive and directly use sql statements to analyze the data in HBase, which is very convenient

Requirement 1: save the data of hive analysis results to HBase

Step 1: copy the five dependent jar packages of hbase to the lib directory of hive

Copy the five jar packages of HBase to the lib directory of hive

The jar packages of HBase are in / export/servers/hbase-2.0.0/lib

We need to copy five jar packages with the following names

hbase-client-2.0.0.jar

hbase-hadoop2-compat-2.0.0.jar

hbase-hadoop-compat-2.0.0.jar

hbase-it-2.0.0.jar

hbase-server-2.0.0.jar

We directly execute the following command in node03 to make the jar package dependent by creating a soft connection

ln -s /export/servers/hbase-2.0.0/lib/hbase-client-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-client-2.0.0.jar

ln -s /export/servers/hbase-2.0.0/lib/hbase-hadoop2-compat-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-hadoop2-compat-2.0.0.jar

ln -s /export/servers/hbase-2.0.0/lib/hbase-hadoop-compat-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-hadoop-compat-2.0.0.jar

ln -s /export/servers/hbase-2.0.0/lib/hbase-it-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-it-2.0.0.jar

ln -s /export/servers/hbase-2.0.0/lib/hbase-server-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-server-2.0.0.jar

Step 2: modify hive's configuration file

Edit the hive configuration file hive-site.xml on the node03 server and add the following two lines of configuration

cd /export/servers/apache-hive-2.1.0-bin/conf

vim hive-site.xml

<property>

<name>hive.zookeeper.quorum</name>

<value>node01,node02,node03</value>

</property>

<property>

<name>hbase.zookeeper.quorum</name>

<value>node01,node02,node03</value>

</property>

Step 3: modify the hiv-env.sh configuration file and add the following configuration

cd /export/servers/apache-hive-2.1.0-bin/conf

vim hive-env.sh

export HADOOP_HOME=/export/servers/hadoop-2.7.5

export HBASE_HOME=/export/servers/hbase-2.0.0

export HIVE_CONF_DIR=/export/servers/apache-hive-2.1.0-bin/conf

Step 4: create a table in hive and load the following data

hive intermediate table

Enter hive client

cd /export/servers/apache-hive-2.1.0-bin/

bin/hive

Create the hive database and the database table corresponding to hive

create database course;

use course;

create external table if not exists course.score(id int,cname string,score int) row format delimited fields terminated by '\t' stored as textfile ;

The prepared data are as follows:

node03 executes the following command to prepare the data file

cd /export/

vim hive-hbase.txt

1 zhangsan 80

2 lisi 60

3 wangwu 30

4 zhaoliu 70

Load data

Enter hive client to load data

hive (course)> load data local inpath '/export/hive-hbase.txt' into table score;

hive (course)> select * from score;

Step 5: create hive management table to map with HBase

We can create a hive management table to map with the table in hbase. The data in the hive management table will be stored on hbase

Create internal tables in hive

create table course.hbase_score(id int,cname string,score int)

stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

with serdeproperties("hbase.columns.mapping" = "cf:name,cf:score")

tblproperties("hbase.table.name" = "hbase_score");

#Insert data through insert overwrite select

insert overwrite table course.hbase_score select id,cname,score from course.score;

Step 6: view the hbase table in hbase_score

Enter the client of hbase to view the hbase table_ Score and view the data

hbase(main):023:0> list

TABLE

hbase_score

myuser

myuser2

student

user

5 row(s) in 0.0210 seconds

=> ["hbase_score", "myuser", "myuser2", "student", "user"]

hbase(main):024:0> scan 'hbase_score'

ROW COLUMN+CELL

1 column=cf:name, timestamp=1550628395266, value=zhangsan

1 column=cf:score, timestamp=1550628395266, value=80

2 column=cf:name, timestamp=1550628395266, value=lisi

2 column=cf:score, timestamp=1550628395266, value=60

3 column=cf:name, timestamp=1550628395266, value=wangwu

3 column=cf:score, timestamp=1550628395266, value=30

4 column=cf:name, timestamp=1550628395266, value=zhaoliu

4 column=cf:score, timestamp=1550628395266, value=70

4 row(s) in 0.0360 seconds

Requirement 2: create hive external tables and map existing table models in HBase,

Step 1: create a table in HBase and manually insert and load some data

Enter the shell client of HBase, manually create a table, insert and load some data

create 'hbase_hive_score',{ NAME =>'cf'}

put 'hbase_hive_score','1','cf:name','zhangsan'

put 'hbase_hive_score','1','cf:score', '95'

put 'hbase_hive_score','2','cf:name','lisi'

put 'hbase_hive_score','2','cf:score', '96'

put 'hbase_hive_score','3','cf:name','wangwu'

put 'hbase_hive_score','3','cf:score', '97'

The operation was successful and the results are as follows:

hbase(main):049:0> create 'hbase_hive_score',{ NAME =>'cf'}

0 row(s) in 1.2970 seconds

=> Hbase::Table - hbase_hive_score

hbase(main):050:0> put 'hbase_hive_score','1','cf:name','zhangsan'

0 row(s) in 0.0600 seconds

hbase(main):051:0> put 'hbase_hive_score','1','cf:score', '95'

0 row(s) in 0.0310 seconds

hbase(main):052:0> put 'hbase_hive_score','2','cf:name','lisi'

0 row(s) in 0.0230 seconds

hbase(main):053:0> put 'hbase_hive_score','2','cf:score', '96'

0 row(s) in 0.0220 seconds

hbase(main):054:0> put 'hbase_hive_score','3','cf:name','wangwu'

0 row(s) in 0.0200 seconds

hbase(main):055:0> put 'hbase_hive_score','3','cf:score', '97'

0 row(s) in 0.0250 seconds

Step 2: create the external table of hive and map the tables and fields in HBase

Create an external table in hive,

Enter the hive client, and then execute the following command to create the hive external table to map the table data in HBase

CREATE external TABLE course.hbase2hive(id int, name string, score int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:name,cf:score") TBLPROPERTIES("hbase.table.name" ="hbase_hive_score");

15. Pre partition of HBase

1. Why pre partition?

  • Increase data reading and writing efficiency
  • Load balancing to prevent data skew
  • Convenient cluster disaster recovery scheduling region
  • Optimize the number of maps

2. How to pre partition?

Each region maintains startRow and endRowKey. If the added data meets the rowKey range maintained by a region, the data will be handed over to the region for maintenance.

3. How to set the pre partition?

1. Manually specify pre partition

hbase(main):001:0> create 'staff','info','partition1',SPLITS => ['1000','2000','3000','4000']

After completion, as shown in the figure:

2. Generate pre partition using hexadecimal algorithm

hbase(main):003:0> create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}

After completion, as shown in the figure:

3. Creating pre partitions using java APIs

The Java code is as follows:

```java
/**

- Create HBase tables and pre partition through Java API
- **/

@Test

public void hbaseSplit() throws IOException {

//Get connection

Configuration configuration = HBaseConfiguration.create();

configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181,node03:2181");

Connection connection = ConnectionFactory.createConnection(configuration);

Admin admin = connection.getAdmin();

//Customize the algorithm to generate a series of Hash hash values, which are stored in a two-dimensional array

byte[][] splitKeys = {{1,2,3,4,5},{'a','b','c','d','e'}};

//HTableDescriptor is used to set the parameters of our table, including table name, column family and so on

HTableDescriptor hTableDescriptor = new HTableDescriptor(TableName.valueOf("staff3"));

//Add column family

hTableDescriptor.addFamily(new HColumnDescriptor("f1"));

//Add column family

hTableDescriptor.addFamily(new HColumnDescriptor("f2"));

admin.createTable(hTableDescriptor,splitKeys);

admin.close();

}

16. rowKey design skills of HBase

HBase is stored in three-dimensional order. The data in HBase can be quickly located through three dimensions: rowkey (row key), column key (column family and qualifier) and TimeStamp (TimeStamp).

rowkey in HBase can uniquely identify a row of records. There are several ways to query in HBase:

  1. Through the get method, specify the rowkey to obtain a unique record
  2. Set the startRow and stopRow parameters in scan mode for range matching
  3. Full table scan, that is, directly scan all row records in the whole table

1. Rowkey length principle

rowkey is a binary code stream, which can be any string with a maximum length of 64kb. In practical application, it is generally 10-100bytes, which is saved in the form of byte [] and is generally designed to have a fixed length.

It is recommended to be as short as possible, not more than 16 bytes, for the following reasons:

  • The data persistence file HFile is stored according to the KeyValue. If the rowkey is too long, such as more than 100 bytes and 1000w lines of data, the rowkey alone will occupy 100 * 1000w = 1 billion bytes, nearly 1G of data, which will greatly affect the storage efficiency of HFile;
  • MemStore caches some data into memory. If the rowkey field is too long, the effective utilization of memory will be reduced. The system cannot cache more data, which will reduce the retrieval efficiency.

2 rowkey hash principle

If the rowkey is incremented by timestamp, do not put the time in front of the binary code. It is recommended to use the high bit of the rowkey as a hash field, which is randomly generated by the program, and the low bit as a time field. This will improve the probability of data balanced distribution in each RegionServer to achieve load balancing. If there is no hash field, the first field is the time information directly, and all data will be concentrated on one RegionServer. In this way, the load will be concentrated on individual regionservers during data retrieval, causing hot issues and reducing the query efficiency.

3 rowkey uniqueness principle

Its uniqueness must be guaranteed in design. Rowkeys are stored in dictionary order. Therefore, when designing rowkey s, we should make full use of the sorting characteristics, store frequently read data in one piece, and put recently accessed data in one piece.

4 what are hot spots

The rows in HBase are sorted according to the dictionary order of rowkey. This design optimizes the scan operation and can access the relevant rows and the rows that will be read together in a nearby position for scan. However, poor rowkey design is the source of hot spots.

Hot spots occur when a large number of client s directly access one or a few nodes of the cluster (access may be read, write or other operations). A large number of accesses will cause the single machine where the hot region is located to exceed its capacity, resulting in performance degradation and even unavailability of the region, which will also affect other regions on the same RegionServer because the host cannot serve the requests of other regions.

A good data access mode is designed to make full and balanced use of the cluster. In order to avoid write hotspots, rowkey s are designed so that different rows are in the same region, but in the case of more data, the data should be written to multiple regions of the cluster instead of one. Here are some common ways to avoid hot spots and their advantages and disadvantages:

1 add salt

The salt adding here is not salt adding in cryptography, but adding a random number in front of the rowkey, specifically assigning a random prefix to the rowkey to make it different from the beginning of the previous rowkey. The number of prefix types allocated should be consistent with the number of data you want to use scattered to different regions. After adding salt, the rowkeys will be scattered to each region according to the randomly generated prefix to avoid hot spots.

2 hash

Hashing will always salt the same line with a prefix. Hashing can also spread the load across the cluster, but reads are predictable. Using the determined hash allows the client to reconstruct the complete rowkey, and use the get operation to accurately obtain a row data.

3 reverse

The third way to prevent hot spots is to reverse the rowkey in fixed length or digital format. This allows the parts of the rowkey that change frequently (the most meaningless part) to be placed first. This can effectively random rowkeys, but at the expense of the orderliness of rowkeys.

The example of reversing the rowkey takes the mobile phone number as the rowkey, and the string after reversing the mobile phone number can be used as the rowkey, so as to avoid hot issues caused by starting with a fixed mobile phone number

3 timestamp reversal

A common data processing problem is to quickly obtain the latest version of data. Using the inverted timestamp as part of rowkey is very useful for this problem. You can use Long.Max_Value - timestamp is appended to the end of the key. For example, [key] [reverse_timestamp], the latest value of [key] can obtain the first record of [key] through scan [key], because rowkeys in HBase are ordered and the first record is the last entered data.

Other suggestions:

Minimize the size of row keys and column families. In HBase, value is always transmitted with its key. When a specific value is transmitted between systems, its rowkey, column name and timestamp will also be transmitted together. If your rowkey and column name are large, they will take up a lot of storage space at this time.

The column family should be as short as possible, preferably one character.

Long attribute names are readable, but shorter attribute names are better stored in HBase.

17. HBase coprocessor

http://hbase.apache.org/book.html#cp

1. Origin

The most frequently criticized features of HBase as a column family database include: it is difficult to easily establish a "secondary index", and it is difficult to perform operations such as summation, counting and sorting. For example, in the old version (< 0.92) HBase, the total number of rows in the statistics table can be obtained only by using the Counter method and executing MapReduce Job once. Although HBase is integrated in the data storage layer
With MapReduce, it can be effectively used for distributed computing of data tables. However, in many cases, when doing some simple addition or aggregation calculations, if the calculation process is directly placed on the server side, the communication overhead can be reduced and a good performance improvement can be obtained. Therefore, HBase introduced coprocessors after 0.92 to realize some functions
New features of human heart: it can easily establish secondary index, complex filter (predicate push down) and access control.

2. There are two coprocessors: observer and endpoint

(1) Observer is similar to the trigger in traditional database. When some events occur, this kind of coprocessor will be called by the Server side. Observer Coprocessor is some hook hooks scattered in HBase Server-side code, which are called when fixed events occur. For example, there is a hook function prePut before the put operation, which is used in the put operation
It will be called by the Region Server before execution; After the put operation, there is a postPut hook function

Taking HBase 2.0.0 as an example, it provides three observer interfaces:
● RegionObserver: provide data manipulation event hooks of the client: Get, Put, Delete, Scan, etc.
● WALObserver: provide WAL related operation hooks.
● MasterObserver: provide DDL type operation hook. Such as creating, deleting and modifying data tables.
To version 0.96, another regionserver observer is added

The following figure illustrates the principle of this coprocessor with RegionObserver as an example:

(2) Endpoint coprocessors are similar to stored procedures in traditional databases. Clients can call these endpoint coprocessors to execute a section of Server-side code and return the results of Server-side code to clients for further processing. The most common usage is aggregation. If there is no coprocessor, when the user needs to find the largest data in a table, that is

max aggregation operation, you must perform a full table scan, traverse the scan results in the Client code, and perform the operation of finding the maximum value. Such a method can not make use of the concurrency of the underlying cluster, but centralize all calculations to the Client for unified execution, which is bound to be inefficient. With Coprocessor, users can deploy the code for maximum value to HBase Server,
HBase will use multiple nodes of the underlying cluster to perform the maximum operation concurrently. That is, execute the code to calculate the maximum value within each Region, calculate the maximum value of each Region on the Region Server side, and only return the max value to the client. Further process the maximum values of multiple regions at the client to find the maximum value.
In this way, the overall implementation efficiency will be greatly improved
The following figure shows how EndPoint works:

(3) Summary

Observer allows the cluster to behave differently during normal client operations
Endpoint allows you to extend the capabilities of the cluster and open new computing commands to client applications
observer is similar to the trigger in RDBMS. It mainly works on the server
endpoint is similar to the stored procedure in RDBMS and mainly works on the client side
observer can realize such functions as permission management, priority setting, monitoring, ddl control, secondary index, etc
endpoint can realize min, max, avg, sum, distinct, group by and other functions

3. Coprocessor loading mode

There are two loading methods for coprocessors, which we call Static Load and Dynamic Load. Statically loaded coprocessors are called system coprocessors, and dynamically loaded coprocessors are called table coprocessors
1. Static loading

By modifying the hbase-site.xml file, you can start the global aggregation and manipulate the data on all tables. Just add the following code:

<property>

<name>hbase.coprocessor.user.region.classes</name>

<value>org.apache.hadoop.hbase.coprocessor.AggregateImplementation</value>

</property>

One cp class is loaded for all table s, and multiple classes can be loaded with "," split

2. Dynamic loading

Enable table aggregation to take effect only for specific tables. It is implemented through HBase Shell.
Disable specifies the table. hbase> disable ‘mytable’
Add aggregation

hbase> alter 'mytable', METHOD => 'table_att','coprocessor'=>
'|org.apache.Hadoop.hbase.coprocessor.AggregateImplementation||'
#Restart specified table
 hbase> enable 'mytable'

Coprocessor uninstall

4. Coprocessor Observer application practice

Insert data into one table in hbase through the coprocessor Observer, and then copy and save the data to another table through the coprocessor, but only take part of the column data in the first table and save it to the second table

Step 1: create the first table proc1 in HBase

Create a table in HBase with the name user2 and only one column family info

cd /export/servers/hbase-2.0.0/

bin/hbase shell

hbase(main):053:0> create 'proc1','info'

Step 2: create the second table proc2 in Hbase

Create the second table 'proc2' as the target table, and copy some columns inserted into the first table into the 'proc2 table using the coprocessor

hbase(main):054:0> create 'proc2','info'

Step 3: develop HBase coprocessor

Copo for HBase development

**public class** MyProcessor **implements** RegionObserver,RegionCoprocessor {**static** Connection *connection* = **null**;**static** Table *table* = **null**;**static**{
Configuration conf = HBaseConfiguration.*create*();
conf.set(**"hbase.zookeeper.quorum"**,**"node01:2181"**);**try** {*connection* = ConnectionFactory.*createConnection*(conf);*table* = *connection*.getTable(TableName.*valueOf*(**"proc2"**));
} **catch** (Exception e) {
e.printStackTrace();
}
}**private** RegionCoprocessorEnvironment **env** = **null**;**private static final** String ***FAMAILLY_NAME*** = **"info"**;**private static final** String ***QUALIFIER_NAME*** = **"name"**;*//2.0 add this method, otherwise it will not take effect * @ override * * public * * optional < regionobserver > getregionobserver() {* / / extremely important to be sure that the coprocessor is invoked as a regionobserver * * * return * * optional. * of * (* * this * *);
}
@Override**public void** start(CoprocessorEnvironment e) **throws** IOException {**env** = (RegionCoprocessorEnvironment) e;
}
@Override**public void** stop(CoprocessorEnvironment e) **throws** IOException {*// nothing to do here*}*/**
* Override the prePut method to intercept before we insert data,
* **@param e*** **@param put** put Object encapsulates the data we need to insert into the target table
* **@param edit*** **@param durability*** **@throws** IOException
*/*@Override**public void** prePut(**final** ObserverContext<RegionCoprocessorEnvironment> e,**final** Put put, **final** WALEdit edit, **final** Durability durability)**throws** IOException {**try** {*//Get the rowkey * * * byte * * [] rowbytes of the inserted data through the put object = put. Getrow();
String rowkey = Bytes.*toString*(rowBytes);*//Get the value of the name field * list < cell > List = put.get (bytes. * tobytes * (* * * factory_name * * *), bytes. * tobytes * (* * * qualifier_name * * *)); * * if * * (list = = * * null * * | list. Size() = = 0) {* * return * *;
}*//Get the info column family and the cell*Cell cell2 = list.get(0) corresponding to the name column; * / / get the data value * String nameValue = Bytes.*toString*(CellUtil.*cloneValue*(cell2)); * / / create a put object and insert the data into the proc2 table * Put put2 = **new** Put(rowkey.getBytes());
put2.addColumn(Bytes.*toBytes*(***FAMAILLY_NAME***), Bytes.*toBytes*(***QUALIFIER_NAME***), nameValue.getBytes());*table*.put(put2);*table*.close();
} **catch** (Exception e1) {**return** ;
}
}
}

Step 4: type the project into a jar package and upload it to HDFS

Print our coprocessor into a jar package. There is no need to use any packaging plug-ins here, and then upload it to hdfs

Upload the jar package to the / export/servers path of linux

cd /export/servers

mv original-hbase-1.0-SNAPSHOT.jar processor.jar

hdfs dfs -mkdir -p /processor

hdfs dfs -put processor.jar /processor

Step 5: mount the jar package into the proc1 table

hbase(main):056:0> **describe 'proc1'**

hbase(main):055:0> **alter 'proc1',METHOD => 'table_att','Coprocessor'=>'hdfs://node01:8020/processor/processor.jar|cn.itcast.hbasemr.demo4.MyProcessor|1001|'**

View again'proc1'Watch,

hbase(main):043:0> describe 'proc1'

You can see that our unloader has been loaded

Step 6: add data to proc1 table

Enter the HBase shell client, and then directly execute the following command to add data to the proc1 table

put 'proc1','0001','info:name','zhangsan'

put 'proc1','0001','info:age','28'

put 'proc1','0002','info:name','lisi'

put 'proc1','0002','info:age','25'

Add data to the proc1 table, and then

scan 'proc2'

We will find that data is also inserted in the proc2 table, and there are only info column family and name column

Note: if you need to uninstall our coprocessor, enter the shell command line of hbase and execute the following commands

disable 'proc1'

alter 'proc1',METHOD=>'table_att_unset',NAME=>'coprocessor$1'

enable 'proc1'

18. Basic introduction of secondary index in HBase

Since the query of HBase is relatively weak, it is basically impossible or difficult to implement complex statistical requirements such as select name,salary,count(1),max(salary) from user group by name,salary order by salary, etc. when we use HBase, we generally use the secondary index scheme

The primary index of HBase is rowkey, and we can only retrieve it through rowkey. If we make some combined queries with the columns of the column family in HBase, we need to use the secondary index scheme of HBase for multi condition queries.

  1. MapReduce scheme

  2. ITHBASE (indexed transitional HBase) scheme

  3. IHBASE (Index HBase) scheme

  4. HBase coprocessor scheme

  5. Solr+hbase scheme

  6. CCIndex (comprehensive clustering index) scheme

Common secondary indexes can be implemented in various other ways, such as Phoenix, solr or ES

19. HBase integration hue

1. Introduction to Hue

HUE=Hadoop User Experience

When there is no HUE, if we want to view the status of each component of the Hadoop ecosystem, we can use their webconsole addresses:

**HDFS: NameNode Webpage http://ip:50070**

**SecondaryNameNode Webpage: http://ip:50090**

**Yarn: http://ip:8088**

**HBase: http://ip:16010**

**Hive http://ip:9999/hwi/**

**Spark http://ip:8080**

It's certainly possible to check one by one, but... It's time-consuming and a little troublesome. HUE is the integration of these. You can view the status of all the above components and carry out some operations in one place of HUE.

Hue is an open source Apache Hadoop UI system, which evolved from Cloudera Desktop. Finally, Cloudera company contributed it to the Hadoop community of Apache foundation, which is implemented based on the Python Web framework Django.

By using Hue, we can interact with Hadoop cluster on the browser side Web console to analyze and process data, such as operating data on HDFS, running MapReduce Job, executing Hive SQL statement, browsing HBase database, etc.

HUE link

Hue's architecture

Core functions

  • SQL editor, support Hive, Impala, MySQL, Oracle, PostgreSQL, SparkSQL, Solr SQL, Phoenix
  • Various charts of search engine Solr
  • Friendly interface support for Spark and Hadoop
  • Support the scheduling system Apache Oozie, which can edit and view workflow

These functions provided by HUE are more friendly than the interfaces provided by various components of Hadoop ecology, but some scenarios requiring debug ging may still need to use the native system to find the cause of the error more deeply.

When viewing Oozie workflow in HUE, you can also easily see the DAG diagram of the whole workflow. However, the DAG diagram has been removed in the latest version. You can only see the action list in workflow and the jump relationship between them. Those who want to see the DAG diagram can still use oozie's native interface system.

1. Access HDFS and file browsing

2. Through web debugging and development, hive and data result display

3. Query solr and result display, report generation

4. Debug and develop impala interactive SQL Query through web

5. spark debugging and development

7. oozie task development, monitoring, and workflow coordination and scheduling

8. Hbase data query and modification, data display

9. Hive metadata query

10. MapReduce task progress viewing and log tracking

11. Create and submit MapReduce, Streaming and Java job tasks

12. Development and debugging of Sqoop2

13. Browsing and editing of Zookeeper

14. Query and display of databases (MySQL, PostGres, SQlite, Oracle)

One sentence summary: Hue is a friendly interface integration framework, which can integrate various frameworks we have learned and will learn. One interface can view and execute all frameworks

2. Environmental preparation and installation of Hue

Hue can be installed in many ways, including rpm package, tar.gz package and cloudera manager. We use tar.gz package to install here

Step 1: Download dependent packages

The node03 server executes the following command to download dependent packages online

yum install ant asciidoc cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-plain gcc gcc-c++ krb5-devel libffi-devel libxml2-devel libxslt-devel make mysql mysql-devel openldap-devel python-devel sqlite-devel gmp-devel openssl-devel -y

Step 2: install and configure maven

To compile hue, you need to use maven to download some other jar packages. Here we can install maven on the node03 server

node03 execute the following command to install maven

wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo

sed -i s/$releasever/6/g /etc/yum.repos.d/epel-apache-maven.repo

yum install -y apache-maven

mvn --version

Configure the download address of maven

vim /etc/maven/settings.xml

<mirror>

<id>alimaven</id>

<mirrorOf>central</mirrorOf>

<name>aliyun maven</name>

<url>http://maven.aliyun.com/nexus/content/groups/public/</url>

</mirror>

<mirror>

<id>ui</id>

<mirrorOf>central</mirrorOf>

<name>Human Readable Name for this Mirror.</name>

<url>http://uk.maven.org/maven2/</url>

</mirror>

<mirror>

<id>jboss-public-repository-group</id>

<mirrorOf>central</mirrorOf>

<name>JBoss Public Repository Group</name>

<url>http://repository.jboss.org/nexus/content/groups/public</url>

</mirror>

Step 3: add ordinary users to the linux operating system

For the installation of hue, you must add a normal user hue. Otherwise, an error is reported at startup. You can directly add a normal user to the node03 server

useradd hue

passwd hue

Step 4: download the hue compressed package and upload and unzip it

The node03 server executes the following command to download and unzip the installation package

cd /export/softwares

wget [http://gethue.com/downloads/releases/4.0.1/hue-4.0.1.tgz](http://gethue.com/downloads/releases/4.0.1/hue-4.0.1.tgz)

tar -zxf hue-4.0.1.tgz -C /export/servers/

Step 5: modify the configuration file

Modify hue's configuration file hue.ini

node03 executes the following command to modify hue's configuration file

cd /export/servers/hue-4.0.0/desktop/conf/

vim hue.ini

#General configuration

[desktop]

secret_key=jFE93j;2[290-eiw.KEiwN2s3['d;/.q[eIW^y#e=+Iei*@Mn<qW5o

http_host=node03.hadoop.com

time_zone=Asia/Shanghai

server_user=root

server_group=root

default_user=root

default_hdfs_superuser=root

#The configuration uses mysql as the storage database of hue, which is about 561 lines of hue.ini

[[database]]

engine=mysql

host=node03.hadoop.com

port=3306

user=root

password=123456

name=hue

Step 6: create a mysql database

Enter the mysql client and create the mysql database

mysql –uroot -p

establish hue database

create database hue default character set utf8 default collate utf8_general_ci;

Step 7: compile hue

node03 Execute the following command to compile

cd /export/servers/hue-3.9.0-cdh5.14.0

make apps

Note: if the compilation fails, it needs to be compiled again. Compile several times more, and the network speed is fast enough to pass the compilation

make clean

make apps

Step 8: start the hue service and access the page

node03 executes the following command to start the service

cd /export/servers/hue-3.9.0-cdh5.14.0

build/env/bin/supervisor

Page access

http://node03:8888

For the first access, you need to set the administrator user and password

The user name and password of our administrator here shall be consistent with the user name and password of our hadoop installation as far as possible,

The user name and password for installing hadoop are root 123456 respectively

The root user is used for the first login, and the password is 123456

After entering, we found that our hue page reported an error. This error is mainly due to hive. Because there was an error during the integration of hue and hive, we need to configure our hue and hive for integration. Next, let's see how our hue and hive and hadoop integrate

3. Integration of hue with other frameworks

3.1. HDFS and yarn integration of hue and hadoop

Step 1: change the core-site.xml configuration of all hadoop nodes

Remember to restart the hdfs and yarn clusters after changing the core-site.xml

Three machines change core-site.xml

<property>

<name>hadoop.proxyuser.root.hosts</name>

<value>*</value>

</property>

<property>

<name>hadoop.proxyuser.root.groups</name>

<value>*</value>

</property>

Step 2: change the hdfs-site.xml of all hadoop nodes

<property>

<name>dfs.webhdfs.enabled</name>

<value>true</value>

</property>

Step 3: restart hadoop cluster

Execute the following command on the node01 machine

cd /export/servers/hadoop-2.7.5

sbin/stop-dfs.sh

sbin/start-dfs.sh

sbin/stop-yarn.sh

sbin/start-yarn.sh

Step 4: stop the service of hue and continue to configure hue.ini

cd /export/servers/hue-3.9.0-cdh5.14.0/desktop/conf

vim hue.ini

Configure our hue to integrate with hdfs

[[hdfs_clusters]]

[[[default]]]

fs_defaultfs=hdfs://node01.hadoop.com:8020

webhdfs_url=http://node01.hadoop.com:50070/webhdfs/v1

hadoop_hdfs_home=/export/servers/hadoop-2.7.5

hadoop_bin=/export/servers/hadoop-2.7.5/bin

hadoop_conf_dir=/export/servers/hadoop-2.7.5/etc/hadoop

Configure our hue And yarn integrate

[[yarn_clusters]]

[[[default]]]

resourcemanager_host=node01

resourcemanager_port=8032

submit_to=True

resourcemanager_api_url=http://node01:8088

history_server_api_url=http://node01:19888

3.2. Configure hue and hive integration

If we need to configure the integration of hue and hive, we need to start hiveserver2 service of hive

Change hue's configuration hue.ini

Modify hue.ini

[beeswax]

hive_server_host=node03.hadoop.com

hive_server_port=10000

hive_conf_dir=/export/servers/apache-hive-2.1.0-bin/conf

server_conn_timeout=120

auth_username=root

auth_password=123456

[metastore]

#You can use hive to create database tables and other operations

enable_new_create_table=true

Start hive's metastore service

Go to node03 and start hiveserver2 service of hive

cd /export/servers/apache-hive-2.1.0-bin

nohup bin/hive --service hiveserver2 &

Restart hue, and then you can operate hive through the browser page

3.5 integration of hue and HBase

Step 1: modify hue.ini

cd /export/servers/hue-3.9.0-cdh5.14.0/desktop/conf

vim hue.ini

[hbase]

hbase_clusters=(Cluster|node01:9090)

hbase_conf_dir=/export/servers/hbase-2.0.0/conf

Step 2: start the thrift server service of hbase

The first machine starts the thriftserver of hbase by executing the following command

cd /export/servers/hbase-2.0.0

bin/hbase-daemon.sh start thrift

Step 3: start hue

The third machine executes the following command to start hue

cd /export/servers/hue-3.9.0-cdh5.14.0/

build/env/bin/supervisor

Step 4: page access

http://node03:8888/hue/

20. HBase tuning

1. General optimization

1. The metadata backup of NameNode uses SSD

2. The metadata on the NameNode is backed up regularly, hourly or daily. If the data is extremely important, it can be backed up every 5 ~ 10 minutes. The backup can copy the metadata directory through the scheduled task.

3. Specify multiple metadata directories for NameNode, using dfs.name.dir or dfs.namenode.name.dir. One specifies the local disk and one specifies the network disk. This can provide redundancy and robustness of metadata to avoid failure.

4. Set dfs.namenode.name.dir.restore to true to allow you to attempt to restore the dfs.namenode.name.dir directory that failed before. Do this when creating a checkpoint. If multiple disks are set, it is recommended to allow it.

5. The NameNode node must be configured as a RAID1 (mirror disk) structure.

6. Supplement: what are Raid0, Raid0+1, Raid1 and Raid5

Standalone

The most common single disk storage method.

Cluster

Cluster storage is a storage method of distributing data to each node in the cluster, providing a single user interface and interface, so that users can easily use and manage all data in a unified way.

Hot swap

Users can remove and replace the hard disk without shutting down the system or cutting off the power supply, so as to improve the recovery ability, expansibility and flexibility of the system.

Raid0

Raid0 is the most powerful storage array of all raid types. Its working principle is to access continuous data on multiple disks. In this way, when data needs to be accessed, multiple disks can execute side by side, and each disk executes its own part of data requests, which significantly improves the overall access performance of the disk. However, it does not have fault tolerance and is suitable for desktop systems with low cost and low reliability.

Raid1

Also known as mirror disk, it mirrors the data of one disk to another disk, adopts mirror fault tolerance to improve reliability, and has the highest data redundancy capability in raid. When storing data, the data will be written into the mirror disk at the same time, and the read data will only be read from the working disk. In case of failure, the system will read data from the mirror disk, and then restore the correct data of the working disk. This array is highly reliable, but its capacity will be reduced by half. It is widely used in applications with strict data requirements, such as commercial finance, file management and other fields. Only one hard disk is allowed to fail.

Raid0+1

Combine Raid0 and Raid1 technologies, taking into account their advantages. While the data is guaranteed, it can also provide strong storage performance. However, at least 4 or more hard disks are required, but only one disk error is allowed. It is a three high technology.

Raid5

RAID5 can be seen as a low-cost solution for Raid0+1. The array mode of cyclic even check independent access is adopted. The data and the corresponding parity information are distributed and stored on each disk constituting RAID5. When one of the disk data is damaged, use the remaining disk and corresponding parity information to recover / generate the lost data without affecting the data availability. At least 3 or more hard disks are required. It is suitable for operation with large amount of data. Array mode with slightly higher cost, strong storage and strong reliability.

There are other ways of RAID, please check it yourself.

7. Keep enough space in the NameNode log directory to help you find problems.

8. Because Hadoop is an IO intensive framework, try to improve the storage speed and throughput (similar to bit width).

2. Linux optimization

1. Enabling the read ahead cache of the file system can improve the reading speed

$ sudo blockdev --setra 32768 /dev/sda

(scream tip: ra is the abbreviation of readahead)

2. Turn off process sleep pool

$ sudo sysctl -w vm.swappiness=0

3. Adjust the ulimit upper limit. The default value is a relatively small number

$ulimit -n view the maximum number of processes allowed

$ulimit -u view the maximum number of files allowed to open

Modification:

$sudo vi /etc/security/limits.conf modify the limit on the number of open files

Add at the end:

  • soft nofile 1024000
  • hard nofile 1024000

Hive - nofile 1024000

hive - nproc 1024000

$sudo vi /etc/security/limits.d/20-nproc.conf modify the limit on the number of processes opened by the user

Amend to read:

#* soft nproc 4096

#root soft nproc unlimited

  • soft nproc 40960

root soft nproc unlimited

4. Turn on the time synchronization NTP of the cluster. Please refer to the previous document

5. Update the system patch (prompt: before updating the patch, please test the compatibility of the new version patch to the cluster nodes)

3. HDFS optimization (HDFS site. XML)

1. Ensure that RPC calls will have a large number of threads

Attribute: dfs.namenode.handler.count

Explanation: this attribute is the default number of threads of NameNode service. The default value is 10. It can be adjusted to 50 ~ 100 according to the available memory of the machine

Attribute: dfs.datanode.handler.count

Explanation: the default value of this attribute is 10, which is the number of processing threads of the DataNode. If the HDFS client program has many read-write requests, it can be adjusted to 1520. The larger the set value, the more memory consumption. Do not adjust it too high. In general business, 510 is enough.

2. Adjustment of the number of copies

Attribute: dfs.replication

Explanation: if the amount of data is huge and not very important, it can be adjusted to 23. If the data is very important, it can be adjusted to 35.

3. Adjustment of file block size

Attribute: dfs.blocksize

Explanation: for block size definition, this attribute should be set according to the size of a large number of single files stored. If a large number of single files are less than 100M, it is recommended to set it to 64M block size. For cases greater than 100M or reaching GB, it is recommended to set it to 256M. Generally, the setting range fluctuates between 64M and 256M.

4. MapReduce optimization (mapred site. XML)

1. Adjust the number of Job task service threads

mapreduce.jobtracker.handler.count

This attribute is the number of Job task threads. The default value is 10. It can be adjusted to 50 ~ 100 according to the available memory of the machine

2. Number of Http server worker threads

Attribute: mapreduce.tasktracker.http.threads

Explanation: define the number of HTTP server working threads. The default value is 40. For large clusters, it can be adjusted to 80 ~ 100

3. File sorting merge optimization

Attribute: mapreduce.task.io.sort.factor

Explanation: the number of data streams merged simultaneously during file sorting, which also defines the number of files opened at the same time. The default value is 10. If you increase this parameter, you can significantly reduce disk IO, that is, reduce the number of file reads.

4. Set task concurrency

Attribute: mapreduce.map.special

Explanation: this attribute can set whether tasks can be executed concurrently. If there are many but small tasks, setting this attribute to true can significantly speed up task execution efficiency. However, for tasks with very high delay, it is recommended to change it to false, which is similar to thunderbolt download.

5. Compression of MR output data

Properties: mapreduce.map.output.compress, mapreduce.output.fileoutputformat.compress

Explanation: for large clusters, it is recommended to set the output of map reduce to compressed data, but not for small clusters.

6. Number of optimized Mapper and Reducer

Properties:

mapreduce.tasktracker.map.tasks.maximum

mapreduce.tasktracker.reduce.tasks.maximum

Explanation: the above two attributes are the number of maps and Reduce that a single Job task can run simultaneously.

When setting the above two parameters, you need to consider the number of CPU cores, disk and memory capacity. Suppose an 8-core CPU, and the business content consumes CPU very much, then the number of maps can be set to 4. If the business does not particularly consume CPU, then the number of maps can be set to 40 and the number of reduce can be set to 20. After modifying the values of these parameters, you must observe whether there are tasks waiting for a long time. If so, you can reduce the number to speed up task execution. If you set a large value, it will cause a lot of context switching and data exchange between memory and disk. There is no standard configuration value here, Choices need to be made based on business and hardware configuration and experience.

At the same time, do not run too many MapReduce at the same time, which will consume too much memory, and the task will execute very slowly. We need to set a maximum value of MR task concurrency according to the number of CPU cores and memory capacity, so that the task with a fixed amount of data can be fully loaded into memory, so as to avoid frequent memory and disk data exchange, so as to reduce disk IO and improve performance.

Approximate ratio:

Approximate estimation formula:

map = 2 + ⅔cpu_core

reduce = 2 + ⅓cpu_core

5. HBase optimization

1. Append content to HDFS file

Isn't it not allowed to add content? Yes, look at the background story:

Attribute: dfs.support.append

Files: hdfs-site.xml, hbase-site.xml

Explanation: enabling HDFS additional synchronization can cooperate with HBase data synchronization and persistence. The default value is true.

2. Optimize the maximum number of file openings allowed for DataNode

Attribute: dfs.datanode.max.transfer.threads

File: hdfs-site.xml

Explanation: HBase generally operates a large number of files at one time. It is set to 4096 or higher according to the number and scale of clusters and data actions. Default: 4096

3. Optimize latency for data operations with high latency

Attribute: dfs.image.transfer.timeout

File: hdfs-site.xml

Explanation: if the delay is very high for a data operation and the socket needs to wait longer, it is recommended to set this value to a larger value (60000 milliseconds by default) to ensure that the socket will not be timed out.

4. Optimize data write efficiency

Properties:

mapreduce.map.output.compress

mapreduce.map.output.compress.codec

File: mapred-site.xml

Explanation: opening these two data can greatly improve the file writing efficiency and reduce the writing time. The first attribute value is modified to true, and the second attribute value is modified to org.apache.hadoop.io.compress.GzipCodec

5. Optimize DataNode storage

Attribute: dfs.datanode.failed.volumes.summarized

File: hdfs-site.xml

Explanation: the default value is 0, which means that when a disk in a DataNode fails, it will be considered that the DataNode has been shut down. If it is changed to 1, when a disk fails, the data will be copied to other normal datanodes, and the current DataNode will continue to work.

6. Set the number of RPC listeners

Attribute: hbase.regionserver.handler.count

File: hbase-site.xml

Explanation: the default value is 30. It is used to specify the number of RPC listeners. It can be adjusted according to the number of requests from the client. This value is increased when there are many read-write requests.

7. Optimize hsstore file size

Attribute: hbase.hregion.max.filesize

File: hbase-site.xml

Explanation: the default value is 10737418240 (10GB). If you need to run the MR task of HBase, you can reduce this value because a region corresponds to a map task. If a single region is too large, the execution time of the map task will be too long. This value means that if the size of hfile reaches this value, the region will be divided into two hfiles.

8. Optimize hbase client cache

Attribute: hbase.client.write.buffer

File: hbase-site.xml

Explanation: it is used to specify the HBase client cache. Increasing this value can reduce the number of RPC calls, but it will consume more memory. Otherwise, it will be the opposite. Generally, we need to set a certain cache size to reduce the number of RPCs.

9. Specifies the number of rows obtained by scan.next scanning HBase

Attribute: hbase.client.scanner.caching

File: hbase-site.xml

Explanation: used to specify the default number of rows obtained by the scan.next method. The larger the value, the greater the memory consumption.

6. Memory optimization

HBase operation requires a lot of memory overhead. After all, tables can be cached in memory. Generally, 70% of the whole available memory will be allocated to the Java heap of HBase. However, it is not recommended to allocate very large heap memory, because if the GC process lasts too long, the RegionServer will be unavailable for a long time. Generally, 16~48G memory is enough. If the system memory is insufficient because the framework occupies too much memory, the framework will also be dragged to death by the system service.

7. JVM optimization

File involved: hbase-env.sh

1. Parallel GC

Parameters: - XX:+UseParallelGC

Explanation: turn on parallel GC

2. Number of threads simultaneously processing garbage collection

Parameter: - XX:ParallelGCThreads=cpu_core – 1

Explanation: this property sets the number of threads that process garbage collection at the same time.

3. Disable manual GC

Parameter: - XX:DisableExplicitGC

Explanation: prevents developers from manually invoking GC

8. Zookeeper optimization

1. Optimize Zookeeper session timeout

Parameter: zookeeper.session.timeout

File: hbase-site.xml

Explanation: in hbase-site.xml, set zoomeeper.session.timeout to 30 seconds or less to bound failure detection (20-30 seconds is a good start). This value is directly related to the maximum cycle of server downtime found by the master. The default value is 30 seconds. If this value is too small, the RegionServer will be temporarily unavailable when HBase writes a large amount of data and GC occurs, Thus, no heartbeat packet is sent to ZK, which eventually leads to the slave node shutdown. Generally, about 20 clusters need to be equipped with 5 zookeeper s.

Tags: Java Big Data HBase

Posted on Mon, 11 Oct 2021 18:02:49 -0400 by greedyisg00d