0864-7.1.6 - how to migrate data through HBase snapshots across CDP clusters

1. Purpose of document preparation

This document mainly uses HBase snapshot to export the full amount of historical data and restore it to the new HBase cluster, then transforms the original ExportSnapshot class, realizes incremental export by comparing the changed files, and finally realizes the incremental backup and restore of HBase across clusters.

  • testing environment

1.CDH7.1.4. Enable Kerberos and hbase 2.2.3

2.CDP7.1.6. Enable Kerberos and hbase 2.2.3

3. User operation using ldapuser1

2. Operation steps

2.1 generate a test table in cdp7.1.4

2.1.1 use the pe command of HBase to generate a 10G table

hbase org.apache.hadoop.hbase.PerformanceEvaluation --compress=SNAPPY --size=10 sequentialWrite 10

Report error due to insufficient permission to create table

The ranger configuration is as follows

2.1.2 test table HDFS size

hadoop fs -du -h /hbase/data/default/

Since the compression format of TestTable table is specified as SNAPPY here, it is only 2.5GB on HDFS

2.1.3 total data volume of TestTable, 10485760 pieces of data in total

count 'TestTable'

2.2 full data backup and restore of HBase across clusters through snapshots

2.2.1 generate TestTable snapshot

snapshot 'TestTable','TestTable-snapshot1'

2.2.2 exporting TestTable snapshot data

On the command line, use the ExportSnapshot that comes with HBase to export snapshots

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot TestTable-snapshot1 -copy-to hdfs://cdp02:8020/tmp/hbasebackup/TestTable-snapshot1

View exported snapshot data

hadoop fs -ls /tmp/hbasebackup/TestTable-snapshot1
hadoop fs -du -h /tmp/hbasebackup/TestTable-snapshot1

You can see that exporting snapshots actually exports the snapshot information and all data files recorded by snapshots to the. HBase snapshot and archive directories under the specified directory.

The format and content of CDSW product documents have changed significantly.

2.2.3 copy snapshot data to cdp7.1.6 cluster

Because both clusters have Kerberos authentication, mutual trust must be done between the two clusters before using the distcp command (you can contact the cdh operation and maintenance personnel)

hadoop distcp hdfs://cdp02:8020/tmp/hbasebackup/TestTable-snapshot1/.hbase-snapshot/TestTable-snapshot1 hdfs://cdh3.macro.com:8020/hbase/.hbase-snapshot
hadoop distcp hdfs://cdp02:8020/tmp/hbasebackup/TestTable-snapshot1/archive/data/default/TestTable hdfs://cdh3.macro.com:8020/hbase/archive/data/default

Copy error, insufficient permission

The configuration of cluster cdp7.1.6 Ranger is as follows

View snapshot data replicated to cdp7.1.6 cluster

hdfs dfs -ls hdfs://cdh3.macro.com:8020/hbase/.hbase-snapshot
hdfs dfs -ls hdfs://cdh3.macro.com:8020/hbase/archive/data/default
hdfs dfs -du -h hdfs://cdh3.macro.com:8020/hbase/

2.2.4 restore TestTable table using testtable-snapshot 1 snapshot

Open the following permissions, otherwise the problem of insufficient permissions will be reported

If it is not opened, the following error will be reported

list_snapshots
restore_snapshot 'TestTable-snapshot1'

2.2.5 verify whether the recovered table data is consistent with the data at the time of snapshot

The data recovered through the snapshot is consistent with the total number of data at the time of snapshot, and the table content is consistent.

2.3 realize HBase incremental data backup and restore across clusters through snapshots

2.3.1 modifying TestTable data

Modify a piece of data and add a piece of data through hbase shell

put 'TestTable','11111111111111111111111111','info0:AppendTest','1111'
put 'TestTable','00000000000000000000000000','info0:AppendTest','00000'

The total amount of data in TestTable is 10485761

count 'TestTable'

2.3.2 create TestTable snapshot again

snapshot 'TestTable','TestTable-snapshot2'

2.3.3 export incremental data of the second snapshot

This step is mainly to export the incremental data between TestTable-snapshot2 and TestTable-snapshot1 snapshots. The default ExportSnapshot method of HBase is that there is no incremental snapshot export method. Here, the source code is modified on the basis of the original snapshot export to complete the export of incremental data between the two snapshots.

GitHub source address: https://github.com/javaxsky/hbaseexport You can download the compiled jar package and put the balancer.jar and hbase-export-1.0-SNAPSHOT.jar in the directory "/ opt/cloudera/parcels/CDH/lib/hbase/lib"

Execute the following command on the command line to export the incremental data of the two snapshots to HDFS

hbase org.hadoop.hbase.dataExport.ExportSnapshot -snapshot TestTable-snapshot2 -copy-to hdfs://cdp02:8020/tmp/hbasebackup/snapshot2-snapshot1/ -snapshot-old TestTable-snapshot1

Exported data directory

hadoop fs -du -h /tmp/hbasebackup/snapshot2-snapshot1

2.3.4 copy the exported snapshot file to CDP7.1.6 cluster

hadoop distcp hdfs://cdp02:8020/tmp/hbasebackup/snapshot2-snapshot1/.hbase-snapshot/TestTable-snapshot2 hdfs://cdh3.macro.com:8020/hbase/.hbase-snapshot
hadoop distcp hdfs://cdp02:8020/tmp/hbasebackup/snapshot2-snapshot1/archive/data/default/TestTable hdfs://cdh3.macro.com:8020/hbase/archive/data/default

Data directory after importing incremental data

hdfs dfs -du -h hdfs://cdh3.macro.com:8020/hbase/

2.3.5 the command line uses the snapshot to restore the TestTable data and verify the data

disable 'TestTable'
list_snapshots
restore_snapshot 'TestTable-snapshot2'
enable 'TestTable'
get 'TestTable','00000000000000000000000000'
count 'TestTable'

3. Summary

The Snapshot of HBase only involves the record of metadata, not the copy of data

The export snapshot of HBase is performed at the HDFS level, which will not cause additional burden on the Master and RegionServer services of HBase.

The rewritten ExportSnapshot incremental snapshot export is realized through the difference between the two snapshot file lists. You only need to export the files with differences.

Exporting snapshot data using ExportSnapshot does not cause the export data to expand, which is basically the same as the size of the original HBase table with snapshot compression enabled.

When modifying the ExportSnapshot provided with HBase, you need to obtain the source code according to the corresponding HBase version for modification. The Packages path of ExportSnapshot of different versions has been changed.

When transferring SnapShot files across clusters, it is recommended to use the Cloudera enterprise version function BDR to realize file copy transmission across clusters.

Posted on Tue, 02 Nov 2021 10:28:43 -0400 by PhilGDUK