Hadoop learning - HBASE installation, command line operation and java operation

Before using HABSE, install a zookeeper

I used to write https://www.cnblogs.com/wpbing/p/11309761.html

 

Let's briefly introduce HBASE

HBASE is a database, which can provide real-time random reading and writing of data

He is a nosql database, which is not structured. He can only do some rough queries, such as the connection query between multiple tables, which is very difficult for him (at least I am not a spicy chicken).

It's the first time I've come into contact with this kind of nosql. People's table structures are not the same. That's all,

He has an Xingjian (something like a primary key)

And then all that's left is that you can define several families of columns

In every lineage,

The key and value of each column family. A pair of kv is called a cell.
Each value can have multiple values, not one

 

 

 

The table model of Hbase is different from that of relational database

l. the table of Hbase has no fixed field definition;

l. in the table of Hbase, each row stores some key value pairs

l the table of Hbase has the partition of column family. You can specify which kv to insert into which column family

The table of Hbase is divided according to the column family in physical storage, and the data of different column families must be stored in different files

l each row in the table of Hbase is fixed with a row key, and the row key of each row cannot be repeated in the table

l the data in hbase, including row keys, key and value, are of byte [] type. hbase is not responsible for maintaining data types for users

l. HBASE has poor support for transactions

 

Table data of Hbase is stored in HDFS file system

HBASE is a distributed system

There is one management role: hmaster (generally two, one active, one backup)

Other data node roles: hregionserver (many, see data capacity)

 

master is used to configure data storage and task allocation,

Region server manages the region data of each table

 

The region server manages the range of each file. Zookeeper is used to detect whether the region server is hung up, and master is used to control the distribution of tasks. When the region server hangs up, how to find someone to replace him.

 

This is the general working mechanism of HBASE

How does the client know which server the data is on? He will look up the index table first, hbase: meta table

Where is the watch,

You can see the information of this index table in zookeeper

This thing is put in zookeeper. First look at the meta of zookeeper and the region server where the change is located, and then visit it to know where its information is

 

 

 

Then when using HBASE, you need to have your own Hadoop cluster to ensure that hdfs is normal, and zookeeper is normal. That's all.

Installation is simple

Unzip hbase installation package

Modify HBase env.sh

export JAVA_HOME=/root/apps/jdk1.7.0_67

export HBASE_MANAGES_ZK=false

 

Modify hbase-site.xml

         <configuration>

<! -- specify the path hbase stores on HDFS -- >

        <property>

                <name>hbase.rootdir</name>

                <value>hdfs://hdp-01:9000/hbase</value>

        </property>

<! -- specify hbase is distributed -- >

        <property>

                <name>hbase.cluster.distributed</name>

                <value>true</value>

        </property>

<! -- specify the address of zk, and use "," to divide mu lt iple -- >

        <property>

                <name>hbase.zookeeper.quorum</name>

                <value>hdp-01:2181,hdp-02:2181,hdp-03:2181,hdp-04:2181</value>

        </property>

         </configuration>

 

 

Modify regionservers

hdp-01

hdp-02

hdp-03

hdp-04

bin/start-hbase.sh

After startup, you can also find any machine in the cluster to start a standby master

 

bin/hbase-daemon.sh start master

The newly started master will be in the backup state

If the error is reported, look for the error here, and pay attention to the time error

 

 

HBASE's external port is 16010

At the same time, you can also start a standby master. After starting, you can start it on any machine,

Bin/hbase-daemon.sh start master

You can also try to visit this page

 

 

The system table of hbase records the index table of data, and records which range of data is stored in which region server

3. Start the command line client of hbase

bin/hbase shell

HBase > List / / view the table

HBase > status / / view the cluster status

HBase > version / / view the cluster version

 

1.1. HBASE table model

The table model of hbase is quite different from that of mysql and other relational databases

In hbase's table model, there is the concept of row, but there is no concept of field

There are key value pairs in each row. The number of key value pairs in each row can be various

1.1.1. Key points of hbase table model:

1. A table with a table name

2. A table can be divided into multiple column families (the data of different column families will be stored in different files)

3. Each row in the table has a row key, and row keys cannot be repeated in the table

4. Each pair of kv data in the table is called a cell

5. hbase can store multiple historical versions of data (the number of historical versions can be configured)

6. Due to the large amount of data in the whole table, it will be horizontally divided into several regions (identified by rowkey range), and the data in different regions will also be stored in different files

 

7. hbase will store the inserted data in order:

Point 1: first, you can sort by row key

Point 2: kv in the same row will be sorted by column family and then by k

 

1.1. hbase command line client operation

1.1.1.1. Table building:

create 't_user_info','base_info','extra_info'

Table name, family name, family name

 

 

1.1.1.2. Insert data:

hbase(main):011:0> put 't_user_info','001','base_info:username','zhangsan'

0 row(s) in 0.2420 seconds

 

hbase(main):012:0> put 't_user_info','001','base_info:age','18'

0 row(s) in 0.0140 seconds

 

hbase(main):013:0> put 't_user_info','001','base_info:sex','female'

0 row(s) in 0.0070 seconds

 

hbase(main):014:0> put 't_user_info','001','extra_info:career','it'

0 row(s) in 0.0090 seconds

 

hbase(main):015:0> put 't_user_info','002','extra_info:career','actoress'

0 row(s) in 0.0090 seconds

 

hbase(main):016:0> put 't_user_info','002','base_info:username','liuyifei'

0 row(s) in 0.0060 seconds

 

 

1.1.1.3. Query data mode 1: scan scan

hbase(main):017:0> scan 't_user_info'

ROW                               COLUMN+CELL                                                                                    

 001                              column=base_info:age, timestamp=1496567924507, value=18                                        

 001                              column=base_info:sex, timestamp=1496567934669, value=female                                    

 001                              column=base_info:username, timestamp=1496567889554, value=zhangsan                             

 001                              column=extra_info:career, timestamp=1496567963992, value=it                                    

 002                              column=base_info:username, timestamp=1496568034187, value=liuyifei                             

 002                              column=extra_info:career, timestamp=1496568008631, value=actoress                              

2 row(s) in 0.0420 seconds

 

1.1.1.4. Query data method 2: get single line data

hbase(main):020:0> get 't_user_info','001'

COLUMN                            CELL                                                                                           

 base_info:age                    timestamp=1496568160192, value=19                                                               

 base_info:sex                    timestamp=1496567934669, value=female                                                          

 base_info:username               timestamp=1496567889554, value=zhangsan                                                        

 extra_info:career                timestamp=1496567963992, value=it                                                              

4 row(s) in 0.0770 seconds

 

1.1.1.5. Delete a kv data

hbase(main):021:0> delete 't_user_info','001','base_info:sex'

0 row(s) in 0.0390 seconds

 

Delete entire row of data:

hbase(main):024:0> deleteall 't_user_info','001'

0 row(s) in 0.0090 seconds

 

hbase(main):025:0> get 't_user_info','001'

COLUMN                            CELL                                                                                           

0 row(s) in 0.0110 seconds

 

1.1.1.6. Delete the whole table:

hbase(main):028:0> disable 't_user_info'

0 row(s) in 2.3640 seconds

 

hbase(main):029:0> drop 't_user_info'

0 row(s) in 1.2950 seconds

 

hbase(main):030:0> list

TABLE                                                                                                                            

0 row(s) in 0.0130 seconds

 

=> []

 

1.1. Important features of Hbase -- sorting feature (row key)

hbase will automatically sort and store the data inserted into hbase:

Sorting rules: first look at row key, then column family name, then column (key) name; in dictionary order

 

This feature of Hbase has a great relationship with query efficiency

For example: a table used to store user information, including name, household registration, age, occupation, etc

Then, in a business system, you often need:

Query all users in a province

It is often necessary to query all users with the specified surname in a province

 

Idea: if users of the same province can be stored continuously in hbase storage files, and users with the same surname in the same province can be stored continuously, then the efficiency of the above two query requirements will be improved!!!

 

Method: spell query criteria into rowkey

 

When we create a table, we should be able to view the data in hdfs. But....

 

There is no data in it, but it can be found. Where does the data exist? The data will exist in the memory. This memory space is called memstore, because it will be faster. He will put the hot data here, which is the data just accessed. He will put it in the memory first. But what if the machine goes down at this time? Will the data be lost? Will it not be lost? On the one hand, he will Write data. On the one hand, write logs and put them in the log directory of hdfs

When the memory is full, it will be written to hdfs

Try it. When you stop, you will find that there is data in hdfs

 

Function of Bloom filter: judge whether a data has appeared before

The principle of Bloom filter: transform a data into binary data with only 01 through algorithm,

Then use a large array to store the 01 of each data. Note that they are overlapped. For example, one data position has 1, three positions have 1, and another data position has 1, four positions have 1. After adding, one position has 1, three positions have 1, four positions have 1. If another data position has 1, one position has 1, If there is 1 in position 5, it can be judged that this data has never appeared,

So the bloon filter can judge the data that has not appeared,

However, he judged that the data appeared may not have appeared.

 

Its application in HBASE, for example, the column family of a table managed by region Server, whose real storage location is hdfs, under a directory of hdfs. And there is more than one column family file. For example, when the data of the column family changes, he will generate a new file, because he did not send and modify the file in hdfs, or even if he did not change it, there are many key s and value s in the column family, which will also be placed in different files under this directory

Every file has a bloon filter, which is determined by the binary value of the file kv. When you want to query a data, he will first compare the binary value of the data with the bloon filter of a file. If it matches, he will find the file

 

Some java APIs

 

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.regionserver.BloomType;
import org.junit.Before;
import org.junit.Test;


/**
 *  
 *  1,Building connection
 *  2,Get a table DDL operation tool admin from the connection
 *  3,admin.createTable(Table description object);
 *  4,admin.disableTable(Name of table);
    5,admin.deleteTable(Name of table);
    6,admin.modifyTable(Table name, table description object);    
 *  
 * @author hunter.d
 *
 */
public class HbaseClientDDL {
    Connection conn = null;
    
    @Before
    public void getConn() throws Exception{
        // Build a connection object
        Configuration conf = HBaseConfiguration.create(); // Automatically loaded hbase-site.xml
        conf.set("hbase.zookeeper.quorum", "hdp-01:2181,hdp-02:2181,hdp-03:2181,hdp-04:2181");
        
        conn = ConnectionFactory.createConnection(conf);
    }
    
    
    
    /**
     * DDL
     * @throws Exception 
     */
    @Test
    public void testCreateTable() throws Exception{

        // Construct a DDL Manipulator
        Admin admin = conn.getAdmin();
        
        // Create a table definition description object
        HTableDescriptor hTableDescriptor = new HTableDescriptor(TableName.valueOf("user_info"));
        
        // Create a column family definition description object
        HColumnDescriptor hColumnDescriptor_1 = new HColumnDescriptor("base_info");
        hColumnDescriptor_1.setMaxVersions(3); // Set the maximum number of versions of data stored in this column family,The default is 1.
        
        HColumnDescriptor hColumnDescriptor_2 = new HColumnDescriptor("extra_info");
        
        // Place column family definition information objects into table definition objects
        hTableDescriptor.addFamily(hColumnDescriptor_1);
        hTableDescriptor.addFamily(hColumnDescriptor_2);
        
        
        // use ddl Operator object: admin Build tables
        admin.createTable(hTableDescriptor);
        
        // Close connection
        admin.close();
        conn.close();
        
    }
    
    
    /**
     * Delete table
     * @throws Exception 
     */
    @Test
    public void testDropTable() throws Exception{
        
        Admin admin = conn.getAdmin();
        
        // Stop list
        admin.disableTable(TableName.valueOf("user_info"));
        // Delete table
        admin.deleteTable(TableName.valueOf("user_info"));
        
        
        admin.close();
        conn.close();
    }
    
    // Modify table definition--Add a column family
    @Test
    public void testAlterTable() throws Exception{
        
        Admin admin = conn.getAdmin();
        
        // Fetch old table definition information
        HTableDescriptor tableDescriptor = admin.getTableDescriptor(TableName.valueOf("user_info"));
        
        
        // Construct a new definition of column family
        HColumnDescriptor hColumnDescriptor = new HColumnDescriptor("other_info");
        hColumnDescriptor.setBloomFilterType(BloomType.ROWCOL); // Set the bloon filter type for the column family
        
        // Adding a column family definition to a table definition object
        tableDescriptor.addFamily(hColumnDescriptor);
        
        
        // Give the modified table definition to admin To submit
        admin.modifyTable(TableName.valueOf("user_info"), tableDescriptor);
        
        
        admin.close();
        conn.close();
        
    }
    
    
    /**
     * DML -- Data addition, deletion, modification and query
     */
    
    

}

 

import java.util.ArrayList;
import java.util.Iterator;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellScanner;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.util.Bytes;
import org.junit.Before;
import org.junit.Test;

public class HbaseClientDML {
    Connection conn = null;
    
    @Before
    public void getConn() throws Exception{
        // Build a connection object
        Configuration conf = HBaseConfiguration.create(); // Automatically loaded hbase-site.xml
        conf.set("hbase.zookeeper.quorum", "hdp-01:2181,hdp-02:2181,hdp-03:2181");
        
        conn = ConnectionFactory.createConnection(conf);
    }
    
    
    /**
     * increase
     * Change: put to overwrite
     * @throws Exception 
     */
    @Test
    public void testPut() throws Exception{
        
        // Gets the table object,Conduct DML operation
        Table table = conn.getTable(TableName.valueOf("user_info"));
        
        // Construct the data to be inserted as a Put type(One put Object can only correspond to one rowkey)Object
        Put put = new Put(Bytes.toBytes("001"));
        put.addColumn(Bytes.toBytes("base_info"), Bytes.toBytes("username"), Bytes.toBytes("Zhang San"));
        put.addColumn(Bytes.toBytes("base_info"), Bytes.toBytes("age"), Bytes.toBytes("18"));
        put.addColumn(Bytes.toBytes("extra_info"), Bytes.toBytes("addr"), Bytes.toBytes("Beijing"));
        
        
        Put put2 = new Put(Bytes.toBytes("002"));
        put2.addColumn(Bytes.toBytes("base_info"), Bytes.toBytes("username"), Bytes.toBytes("Li Si"));
        put2.addColumn(Bytes.toBytes("base_info"), Bytes.toBytes("age"), Bytes.toBytes("28"));
        put2.addColumn(Bytes.toBytes("extra_info"), Bytes.toBytes("addr"), Bytes.toBytes("Shanghai"));
    
        
        ArrayList<Put> puts = new ArrayList<>();
        puts.add(put);
        puts.add(put2);
        
        
        // Insert in
        table.put(puts);
        
        table.close();
        conn.close();
        
    }
    
    
    /**
     * Loop insert large amount of data
     * @throws Exception 
     */
    @Test
    public void testManyPuts() throws Exception{
        
        Table table = conn.getTable(TableName.valueOf("user_info"));
        ArrayList<Put> puts = new ArrayList<>();
        
        for(int i=0;i<100000;i++){
            Put put = new Put(Bytes.toBytes(""+i));
            put.addColumn(Bytes.toBytes("base_info"), Bytes.toBytes("username"), Bytes.toBytes("Zhang San"+i));
            put.addColumn(Bytes.toBytes("base_info"), Bytes.toBytes("age"), Bytes.toBytes((18+i)+""));
            put.addColumn(Bytes.toBytes("extra_info"), Bytes.toBytes("addr"), Bytes.toBytes("Beijing"));
            
            puts.add(put);
        }
        
        table.put(puts);
        
    }
    
    /**
     * Delete
     * @throws Exception 
     */
    @Test
    public void testDelete() throws Exception{
        Table table = conn.getTable(TableName.valueOf("user_info"));
        
        // Construct an object to encapsulate the data information to be deleted
        Delete delete1 = new Delete(Bytes.toBytes("001"));
        
        Delete delete2 = new Delete(Bytes.toBytes("002"));
        delete2.addColumn(Bytes.toBytes("extra_info"), Bytes.toBytes("addr"));
        
        ArrayList<Delete> dels = new ArrayList<>();
        dels.add(delete1);
        dels.add(delete2);
        
        table.delete(dels);
        
        
        table.close();
        conn.close();
    }
    
    /**
     * check
     * @throws Exception 
     */
    @Test
    public void testGet() throws Exception{
        
        Table table = conn.getTable(TableName.valueOf("user_info"));
        
        Get get = new Get("001".getBytes());
        
        Result result = table.get(get);
        
        // Take a user specified one from the result key Of value
        byte[] value = result.getValue("base_info".getBytes(), "age".getBytes());
        System.out.println(new String(value));
        
        System.out.println("-------------------------");
        
        // Traverse all of the entire row results kv Cell
        CellScanner cellScanner = result.cellScanner();
        while(cellScanner.advance()){
            Cell cell = cellScanner.current();
            
            byte[] rowArray = cell.getRowArray();  //book kv Byte array of row key
            byte[] familyArray = cell.getFamilyArray();  //Byte array of column family name
            byte[] qualifierArray = cell.getQualifierArray();  //Byte data for column name
            byte[] valueArray = cell.getValueArray(); // value Byte array of
            
            System.out.println("Row key: "+new String(rowArray,cell.getRowOffset(),cell.getRowLength()));
            System.out.println("Family name: "+new String(familyArray,cell.getFamilyOffset(),cell.getFamilyLength()));
            System.out.println("Column names: "+new String(qualifierArray,cell.getQualifierOffset(),cell.getQualifierLength()));
            System.out.println("value: "+new String(valueArray,cell.getValueOffset(),cell.getValueLength()));
            
        }
        
        table.close();
        conn.close();
        
    }
    
    
    /**
     * Query data by row key range
     * @throws Exception 
     */
    @Test
    public void testScan() throws Exception{
        
        Table table = conn.getTable(TableName.valueOf("user_info"));
        
        // Include start row key, not end row key,But if you really want to find the end row key, you can splice an invisible byte on the end row key(\000)
        Scan scan = new Scan("10".getBytes(), "10000\001".getBytes());
        
        ResultScanner scanner = table.getScanner(scan);
        
        Iterator<Result> iterator = scanner.iterator();
        
        while(iterator.hasNext()){
            
            Result result = iterator.next();
            // Traverse all of the entire row results kv Cell
            CellScanner cellScanner = result.cellScanner();
            while(cellScanner.advance()){
                Cell cell = cellScanner.current();
                
                byte[] rowArray = cell.getRowArray();  //book kv Byte array of row key
                byte[] familyArray = cell.getFamilyArray();  //Byte array of column family name
                byte[] qualifierArray = cell.getQualifierArray();  //Byte data for column name
                byte[] valueArray = cell.getValueArray(); // value Byte array of
                
                System.out.println("Row key: "+new String(rowArray,cell.getRowOffset(),cell.getRowLength()));
                System.out.println("Family name: "+new String(familyArray,cell.getFamilyOffset(),cell.getFamilyLength()));
                System.out.println("Column names: "+new String(qualifierArray,cell.getQualifierOffset(),cell.getQualifierLength()));
                System.out.println("value: "+new String(valueArray,cell.getValueOffset(),cell.getValueLength()));
            }
            System.out.println("----------------------");
        }
    }

}

Tags: HBase Hadoop Apache Zookeeper

Posted on Mon, 23 Mar 2020 20:06:05 -0400 by jdbfitz