1. Purpose of document preparation
This document mainly uses HBase snapshot to export the full amount of historical data and restore it to the new HBase cluster, then transforms the original ExportSnapshot class, realizes incremental export by comparing the changed files, and finally realizes the incremental backup and restore of HBase across clusters.
- testing environment
1.CDH7.1.4. Enable Kerberos and hbase 2.2.3
2.CDP7.1.6. Enable Kerberos and hbase 2.2.3
3. User operation using ldapuser1
2. Operation steps
2.1 generate a test table in cdp7.1.4
2.1.1 use the pe command of HBase to generate a 10G table
hbase org.apache.hadoop.hbase.PerformanceEvaluation --compress=SNAPPY --size=10 sequentialWrite 10
Report error due to insufficient permission to create table
The ranger configuration is as follows
2.1.2 test table HDFS size
hadoop fs -du -h /hbase/data/default/
Since the compression format of TestTable table is specified as SNAPPY here, it is only 2.5GB on HDFS
2.1.3 total data volume of TestTable, 10485760 pieces of data in total
2.2 full data backup and restore of HBase across clusters through snapshots
2.2.1 generate TestTable snapshot
2.2.2 exporting TestTable snapshot data
On the command line, use the ExportSnapshot that comes with HBase to export snapshots
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot TestTable-snapshot1 -copy-to hdfs://cdp02:8020/tmp/hbasebackup/TestTable-snapshot1
View exported snapshot data
hadoop fs -ls /tmp/hbasebackup/TestTable-snapshot1
hadoop fs -du -h /tmp/hbasebackup/TestTable-snapshot1
You can see that exporting snapshots actually exports the snapshot information and all data files recorded by snapshots to the. HBase snapshot and archive directories under the specified directory.
The format and content of CDSW product documents have changed significantly.
2.2.3 copy snapshot data to cdp7.1.6 cluster
Because both clusters have Kerberos authentication, mutual trust must be done between the two clusters before using the distcp command (you can contact the cdh operation and maintenance personnel)
hadoop distcp hdfs://cdp02:8020/tmp/hbasebackup/TestTable-snapshot1/.hbase-snapshot/TestTable-snapshot1 hdfs://cdh3.macro.com:8020/hbase/.hbase-snapshot hadoop distcp hdfs://cdp02:8020/tmp/hbasebackup/TestTable-snapshot1/archive/data/default/TestTable hdfs://cdh3.macro.com:8020/hbase/archive/data/default
Copy error, insufficient permission
The configuration of cluster cdp7.1.6 Ranger is as follows
View snapshot data replicated to cdp7.1.6 cluster
hdfs dfs -ls hdfs://cdh3.macro.com:8020/hbase/.hbase-snapshot hdfs dfs -ls hdfs://cdh3.macro.com:8020/hbase/archive/data/default hdfs dfs -du -h hdfs://cdh3.macro.com:8020/hbase/
2.2.4 restore TestTable table using testtable-snapshot 1 snapshot
Open the following permissions, otherwise the problem of insufficient permissions will be reported
If it is not opened, the following error will be reported
list_snapshots restore_snapshot 'TestTable-snapshot1'
2.2.5 verify whether the recovered table data is consistent with the data at the time of snapshot
The data recovered through the snapshot is consistent with the total number of data at the time of snapshot, and the table content is consistent.
2.3 realize HBase incremental data backup and restore across clusters through snapshots
2.3.1 modifying TestTable data
Modify a piece of data and add a piece of data through hbase shell
put 'TestTable','11111111111111111111111111','info0:AppendTest','1111' put 'TestTable','00000000000000000000000000','info0:AppendTest','00000'
The total amount of data in TestTable is 10485761
2.3.2 create TestTable snapshot again
2.3.3 export incremental data of the second snapshot
This step is mainly to export the incremental data between TestTable-snapshot2 and TestTable-snapshot1 snapshots. The default ExportSnapshot method of HBase is that there is no incremental snapshot export method. Here, the source code is modified on the basis of the original snapshot export to complete the export of incremental data between the two snapshots.
GitHub source address: https://github.com/javaxsky/hbaseexport You can download the compiled jar package and put the balancer.jar and hbase-export-1.0-SNAPSHOT.jar in the directory "/ opt/cloudera/parcels/CDH/lib/hbase/lib"
Execute the following command on the command line to export the incremental data of the two snapshots to HDFS
hbase org.hadoop.hbase.dataExport.ExportSnapshot -snapshot TestTable-snapshot2 -copy-to hdfs://cdp02:8020/tmp/hbasebackup/snapshot2-snapshot1/ -snapshot-old TestTable-snapshot1
Exported data directory
hadoop fs -du -h /tmp/hbasebackup/snapshot2-snapshot1
2.3.4 copy the exported snapshot file to CDP7.1.6 cluster
hadoop distcp hdfs://cdp02:8020/tmp/hbasebackup/snapshot2-snapshot1/.hbase-snapshot/TestTable-snapshot2 hdfs://cdh3.macro.com:8020/hbase/.hbase-snapshot hadoop distcp hdfs://cdp02:8020/tmp/hbasebackup/snapshot2-snapshot1/archive/data/default/TestTable hdfs://cdh3.macro.com:8020/hbase/archive/data/default
Data directory after importing incremental data
hdfs dfs -du -h hdfs://cdh3.macro.com:8020/hbase/
2.3.5 the command line uses the snapshot to restore the TestTable data and verify the data
disable 'TestTable' list_snapshots restore_snapshot 'TestTable-snapshot2' enable 'TestTable' get 'TestTable','00000000000000000000000000' count 'TestTable'
The Snapshot of HBase only involves the record of metadata, not the copy of data
The export snapshot of HBase is performed at the HDFS level, which will not cause additional burden on the Master and RegionServer services of HBase.
The rewritten ExportSnapshot incremental snapshot export is realized through the difference between the two snapshot file lists. You only need to export the files with differences.
Exporting snapshot data using ExportSnapshot does not cause the export data to expand, which is basically the same as the size of the original HBase table with snapshot compression enabled.
When modifying the ExportSnapshot provided with HBase, you need to obtain the source code according to the corresponding HBase version for modification. The Packages path of ExportSnapshot of different versions has been changed.
When transferring SnapShot files across clusters, it is recommended to use the Cloudera enterprise version function BDR to realize file copy transmission across clusters.