Start University Data Again-Hbase Article-day 55 Brief Talk about Bloom Filter, Hbase Read and Write, Hbase HA and Mapreduce Read and Write Hbase Data |
Brief Talk about Bloom Filter
Summary:
_Bloom Filter (Bloom Filter) was proposed by Bloom in 1970. It is actually a long binary vector and a series of random mapping functions. Bloom Filter can be used to retrieve whether an element is in a set or not. Its advantages are that the spatial efficiency and query time are much longer than the general algorithm, and its disadvantages are that it has certain error recognition rate and deletion difficulties.
_In computer science, we often encounter time-for-space or space-for-time situations, in which one aspect is best achieved at the expense of another. Bloom Filter introduces another factor besides time-space: error rate. Using Bloom FilterFilter has a certain error rate when it determines whether an element belongs to a set or not. That is, it is possible to mistake elements that do not belong to this set as belonging to this set (False Positive), but it will not mistake elements that belong to this set as not belonging to this set (False Negative)After increasing the error rate, Bloom Filter saves a lot of storage space by allowing a small number of errors.
_Its usage is really easy to understand. Take an example from the application of HBase. We already know that rowKey is stored in HFile. In order to query a rowkey from a series of HFiles, we can use Bloom Filter to quickly determine whether rowkey is in this HFile, thereby filtering out most of the HFiles and reducing the number of blocks that need to be scanned.
Graph parsing:
_BloomFilter is critical to the random reading performance of HBase. It can eliminate unused HFile files, reduce actual IO times and improve random reading performance by get operation and partial scan operation. Here is a brief introduction to Bloom Filter, BloomFilter uses bit arrays for filtering, where each bit in the initial state is 0, as shown in the following figure:
_If at this time there is a set S = {x1, x2,... xn}, Bloom Filter uses k separate hash functions to map each element in the set to 1,...,The range of m_. For any element, the mapped number is set to 1 as the index of the corresponding array of bits. For example, if element X1 is mapped to number 8 by a hash function, then the 8th bit of the array of bits is set to 1. In the following figure, set S has only two elements x and y mapped by three hash functions, respectively (0, 3, 6) and(4, 7, 10), the corresponding bit is set to 1:
_Now, if you want to determine whether another element is in this set, you just need to be mapped by these three hash functions to see if there is zero at the corresponding location, and if there is one, it means that this element must not exist in this set, otherwise it may exist. The following figure shows that z is definitely not in the set x, y_:
As we can see from the above, Bloom Filter has two important parameters
- Number of hash functions
- Size of bit array
The Role of Bloom Filter in HBase
Block s related to Bloom Filter in HFile:
- Scanned Block Section (read while scanning HFile): Bloom Block
- Load-on-open-section (loaded into memory at regionServer startup): BloomFilter Meta Block, Bloom Index Block
- Bloom Block: Bloom data block, which stores an array of bits for the bloom
- Bloom Index Block: Index of Bloom data block
- BloomFilter Meta Block: Some metadata information about bloom data blocks, number of sizes, etc. from the HFile perspective.
Each HFile in HBase has a corresponding array of bits. KeyValue first maps several hash functions when writing to the HFile, changes the corresponding array bits to 1 after mapping, and hash maps after a get request comes in. If there is 0 on the corresponding array bits, the data queried by the get request is not in the HFile.
Bloom Blocks in HFiles store the digit arrays mentioned above. When the HFiles are large, there will be many Data Block s, many KeyValue s, and many rowKeys that need to be mapped into the digit arrays, so the larger the digit arrays and the larger the Bloom Blocks, the larger the Bloom Index appears to solve this problem.Block, which has multiple Bloom Blocks (bit arrays) in a HFile, uses a bit array as part of a continuous Key according to the rowKey split. Queries for rowKeys are then filtered by first locating the Bloom Block through the Bloom Index Block (in memory), then loading the Bloom Block into memory.
Reading and Writing of Hbase
HBase read-write process:
Writing process:
_First a write request is initiated by the client (HBase shell, Java API). Then the client connects to the ZK and looks for the location of the meta table in the ZK, which records the range of rowkeys stored in each region of each table, for example, on node1, then connects to node1, and compares the data in the meta with the rowkeys of the data to be written.Find the regionserver where the rowkey needs to write the region, such as node2, and the client will connect to the RS on node2, then write the data (a Put object), write the data to the HLOG on the RS, then write the data to the MemStore in the Region, and return immediately after writing the memstore(The write request is complete without HMaster involvement: when the data in Memstore reaches 128M, a Flush write operation is triggered to form a storefile)
Reading process:
_is similar to the writing process. Whether read or write, the data in the ZK and Meta tables need to be combined with the Rowkey of the data written or read to locate the RS where Rowkey belongs, such as Noe1. The client will connect to the RS above Noe1, first from BLOCKCACHE, and then from MemStore if it is not found. If it is not found again, it will go to HFILE.Find If the data you are looking for is in_memory=true in the column cluster. The data you find will be cached
Client to speed up read and write requests. Meta's location is cached the first time it is found
Hbase's HA (High Availability)
HmasterHA Start
Start another Hmaster on node1 or node2, execute cd/usr/local/soft/hbase-1.4.6/bin. /hbase-daemonsh start master
Mapreduce reads and writes Hbase data
Be careful:
If the execution of the code shows that the main cannot be found and the dependency cannot be found
Add the following tags to pom.xml, packaging process with packaging dependencies
<build> <plugins> <!-- Java Compiler --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.1</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <!-- With dependency jar Plug-in unit--> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
Read data from Hbase to HDFS
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil; import org.apache.hadoop.hbase.mapreduce.TableMapper; import org.apache.hadoop.hbase.util.Bytes; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class MRReadHBase { // Read the HBase students table data and count the number of people in each class public static class MyMapper extends TableMapper<Text, IntWritable> { @Override protected void map(ImmutableBytesWritable key, Result value, Mapper<ImmutableBytesWritable, Result, Text, IntWritable>.Context context) throws IOException, InterruptedException { // The key into the map task is rowkey String rowkey = Bytes.toString(key.get()); // value entering a map task is one of the data in hbase byte[] row = value.getRow(); // You can also take out the rowkey through getRow String clazz = Bytes.toString(value.getValue("info".getBytes(), "clazz".getBytes())); context.write(new Text(clazz), new IntWritable(1)); } } public static class MyReduce extends Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { int cnt = 0; for (IntWritable value : values) { cnt++; } context.write(key, new IntWritable(cnt)); } } public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); conf.set("hbase.zookeeper.quorum","master:2181,node1:2181,node2:2181"); Job job = Job.getInstance(conf); job.setJobName("MRReadHBase"); job.setJarByClass(MRReadHBase.class); // Configuring Map tasks using TableMapReduceUtil TableMapReduceUtil.initTableMapperJob("students" , new Scan() , MyMapper.class , Text.class , IntWritable.class , job ); // Configure Reduce Task job.setReducerClass(MyReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // Configure Output Path FileOutputFormat.setOutputPath(job, new Path("/mrHBase01")); job.waitForCompletion(true); } }
Read data from local HDFS to Hbase
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hbase.client.Mutation; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.mapred.TableOutputFormat; import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil; import org.apache.hadoop.hbase.mapreduce.TableReducer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; // Read the local score.txt file to calculate the total score for each student and write the result to HBase public class MRWriteHBase { public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException { String[] splits = value.toString().split(","); String rowkey = splits[0]; int score = Integer.parseInt(splits[2]); context.write(new Text(rowkey), new IntWritable(score)); } } public static class WriteHBase extends TableReducer<Text, IntWritable, NullWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, NullWritable, Mutation>.Context context) throws IOException, InterruptedException { int sum = 0; // Save total score for (IntWritable value : values) { sum += value.get(); } Put put = new Put(key.getBytes()); put.addColumn("cf1".getBytes(), "ss".getBytes(), (sum + "").getBytes()); context.write(NullWritable.get(), put); } } public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); conf.set("hbase.zookeeper.quorum", "master:2181,node1:2181,node2:2181"); Job job = Job.getInstance(conf); job.setJobName("MRWriteHBase"); job.setJarByClass(MRWriteHBase.class); // Configure Map Tasks job.setMapperClass(MyMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); // Configuring Reduce tasks using TableMapReduceUtil TableMapReduceUtil.initTableReducerJob( "sum_score" , WriteHBase.class , job ); // Configure Input Path FileInputFormat.addInputPath(job,new Path("/mrHBaseInput1")); job.waitForCompletion(true); } /**Execute steps after writing code * Building tables in HBase: create'sum_score','cf1' * Upload score.txt to/mrHBaseInput1 * Package and upload jar packages * hadoop jar HBase-1.0-jar-with-dependencies.jar com.shujia.Demo6MRWriteHBase */ }
Read data from Hbase to Hbase
package com.tiand7; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil; import org.apache.hadoop.hbase.mapreduce.TableMapper; import org.apache.hadoop.hbase.mapreduce.TableReducer; import org.apache.hadoop.hbase.util.Bytes; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import java.io.IOException; public class MRfromHbase { public static class Mymap extends TableMapper<Text,Text>{ @Override protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException { byte[] bytes = key.get(); String rowkey = Bytes.toString(bytes); String clazz = Bytes.toString(value.getValue("info".getBytes(), "name".getBytes())); context.write(new Text(rowkey),new Text(clazz)); } } public static class MyReduce extends TableReducer<Text,Text,NullWritable>{ @Override protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { Put put = new Put(key.getBytes()); for (Text value : values) { put.addColumn("info".getBytes(),"name".getBytes(),value.getBytes()); } context.write(NullWritable.get(),put); } } public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); conf.set("hbase.zookeeper.quorum", "master:2181,node1:2181,node2:2181"); Job job = Job.getInstance(conf); job.setJobName("MRfromHbase"); job.setJarByClass(MRfromHbase.class); // map TableMapReduceUtil.initTableMapperJob("students",new Scan(),Mymap.class,Text.class,Text.class,job); // reduce TableMapReduceUtil.initTableReducerJob("nametip",MyReduce.class,job); job.waitForCompletion(true); } }
|
|
|
|
Previous Chapter - Hbase Chapter - Introduction to day 54 Hbase, shell, filter
Next Chapter - Update with Edge
|
|
|
|
|
Hear of long thumb press👍Something magical happens! It looks like the picture below, someone you've heard of🧑Found in a month💑💑💑,The next day I won the lottery💴$$$,Get a full score on the exam directly💯,Face suddenly rises😎,Although you don't seem to need it, you can say yes, Wu Yanzu🤵! |