Redis series - bloon filter

When we use Redis as cache, one of the problems we need to consider is cache penetration. There are many types of cache penetration, and different types have different solutions. As one of the solutions, the bloon filter is used. The bloom filter is mainly used to determine whether an element is contained in a large set of elements. When the set of elements is large to a certain extent, we store the elements one by one and then judge by searching, which consumes storage resources very much, and at the same time, the performance is very low. This scenario is exactly where the bloom filter is used.

Version Description: This article is based on Redis-5.0.8 and RedisBloom-2.0.0.

1. Preparation

Redis-5.0.8 official address: Redis-5.0.8
RedisBloom-2.0.0 official address: RedisBloom-2.0.0
After downloading the RedisBloom source package, decompress it, and execute make generation in the decompressed root directory redisbloom.so .

unzip RedisBloom-2.0.0.zip



Configure startup command alias:

vim ~/.bashrc

Add alias configuration:

redis-server="/usr/local/src/redis-5.0.8/src/redis-server /usr/local/src/redis-5.0.8/redis.conf --loadmodule /usr/local/src/RedisBloom-2.0.0/redisbloom.so"

Refresh:

source ~/.bashrc

Start Redis with the bloom filter plug-in:

redis-server

The bloom filter in Redis provides two commands bf.add And bf.exists , for example:

127.0.0.1:6379> del users
(integer) 1
127.0.0.1:6379> bf.madd users u001 u002 u003 u004
1) (integer) 1
2) (integer) 1
3) (integer) 1
4) (integer) 1
127.0.0.1:6379> bf.madd users u007
1) (integer) 1
127.0.0.1:6379> bf.exists users u009
(integer) 0
127.0.0.1:6379> bf.exists users u007
(integer) 1

In addition, the bloon filter has two important parameters: the misjudgment rate and the expected capacity, which can be specified during initialization, and then the error will be specified after initialization. The example is as follows:

127.0.0.1:6379> bf.reserve books 0.01 10000
OK
127.0.0.1:6379> bf.add books java
(integer) 1
127.0.0.1:6379> bf.reserve books 0.01 10000
(error) ERR item exists

2. Testing

Here we use the Redission client to test. The main content of the test is to create a bloom filter and specify its allowable error rate and expected size. Generate a set of random strings, put half of them into the filter, and use the other half to test the error rate. Example code:

public class BloomFilterMain {
    static final String host = "127.0.0.1";
    static final int port = 6379;
    static final String pass = "abcd@1234";
    static final String address = "redis://" + host + ":" + port;

    public static void main(String[] args) {
    	// Test with redission as client
	    redission();
    }

    private static void redission() {
        Config config = new Config();
        config.useSingleServer()
                .setAddress(address)
                .setPassword(pass);

        RedissonClient redisson = Redisson.create(config);

        RBloomFilter<String> strings = redisson.getBloomFilter("strings");

        System.out.println("strings.delete(): " + strings.delete());
        // The specified expected capacity is 10000, and the misjudgment rate is 0.01, i.e. 1%
        System.out.println("strings.tryInit(): " + strings.tryInit(10000L, 0.01D));

		// Generate 20000 string sets of length 12
		// Half of it is put into the filter, and the other half is used to test the error rate
        List<String> total = StringGenerator.genStrings(20000, 12);
        List<String> ins = total.subList(0, total.size() / 2);
        List<String> outs = total.subList(total.size() / 2, total.size());

        for (int i = 0; i < ins.size(); i++) {
            String str = ins.get(i);
            System.out.println(str + " added: " + strings.add(u));
        }

        double falseCnt = 0L;
        for (int i = 0; i < outs.size(); i++) {
            if (uv.contains(outs.get(i))) {
                falseCnt = falseCnt + 1D;
                System.out.println("false: " + outs.get(i));
            }
        }

		// Miscalculation rate
        System.out.println("false probe: " + (falseCnt / outs.size()));

        redisson.shutdown(1L, 1L, TimeUnit.SECONDS);
    }

	// A random string generator that generates a specified number of strings of a specified length
    private static class StringGenerator {
        private static String chars;

        static {
            StringBuilder charsBuilder = new StringBuilder();
            for (int i = 0; i < 26; i++) {
                charsBuilder.append((char) ('a' + i));
            }
            chars = charsBuilder.toString();
        }

        public static List<String> genStrings(int count, int len) {
            List<String> r = new ArrayList<>();

            for (int i = 0; i < count; i++) {
                StringBuilder builder = new StringBuilder();
                for (int j = 0; j < len; j++) {
                    int index = ThreadLocalRandom.current().nextInt(chars.length());
                    builder.append(chars.charAt(index));
                }
                r.add(builder.toString());
            }

            return r;
        }
    }
}

Operation result:

... ...
false probe: 0.0222

3. Principle

The bloom filter records the fingerprint data of an element by maintaining a set of bits. When an element is added to the filter, a set of hash functions will be used internally to find the hash value of the element, and then the hash value obtained will be modeled against the length of the bit array to get the slot, and then the value of the slot will be set to 1. When determining whether an element is contained, we also use the above-mentioned set of hash functions to hash the element, take the hash value to the length of the bit array to get a set of slots, and judge whether the values of these slots are all 1. If so, it means that the element may be in the container, otherwise it means that the element must not be in the container.

There are two notable features of the bloon filter:

  1. If an element is judged to exist, it may exist;
  2. If it is determined that an element does not exist, it must not exist.

Referring to the figure below, suppose that two elements k1 and k2 are put into the filter, and the filter is initialized to calculate the fingerprint information of the element with three hash functions, and the length of the digit group is 16. For k1, the bloom filter calculates 4, 5 and 9 slots to record its fingerprint information; for k2, the bloom filter calculates 7, 9 and 14 slots to record its fingerprint information. Assuming that the existence of k3 element is now determined, the 4, 9 and 14 slots are calculated as their fingerprint data through this set of hash functions. At this time, it is determined that k3 exists, but in fact, these slots are occupied by the fingerprint information of k1 and k2. From this we can get the correctness of the first point.

From the above example, we can see that if we want to make the judgment result of the bloom filter more accurate, we need more slots (i.e. larger number of bits) and more hash functions. However, larger number of bits means more memory overhead, and more hash functions mean that the bloom filter needs more computation. In practice, the bloom filter will calculate the appropriate number of slots and hash functions according to the expected capacity and the allowable error rate. In practice, we can refer to bloom-calculator Calculates the amount of memory consumption and hash functions for the specified expected capacity and allowed misjudgment rate.

Tags: Redis vim Java calculator

Posted on Sat, 27 Jun 2020 02:42:46 -0400 by methodlessman