BitMap implementation principle

In java, an int type takes up 32 bytes. When we use an int array to represent new int[32], the total memory is 32*32bit. Now, if we use every bit of the int byte code to represent a number, then only one int type takes up enough memory space for 32 numbers. In this way, a lot of memory will be saved in the case of large amount of data.

Specific ideas:

If an int takes up 4 bytes, that is, 4 * 8 = 32 bits, then we only need to apply for an int array with the length of int tmp[1+N/32] to store these data, where N represents the total number to be searched, and each element in tmp takes up 32 bits in memory, which can correspond to the decimal number 0-31, so we can get the BitMap table:

tmp[0]: can represent 0 ~ 31

tmp[1]: can represent 32 ~ 63

tmp[2] can represent 64-95

.......

Next, let's see how to convert decimal numbers to corresponding bit bits:

Assuming that the 4 billion int data is: 6,3,8,32,36,..., then the specific BitMap is expressed as:

How to determine which subscript the int number is in the tmp array? In fact, you can directly divide by 32 to get the integer part. For example, if the integer 8 divided by 32 is equal to 0, then 8 is on tmp[0]. In addition, how can we know which bit of 8 is in the 32 bits of tmp[0]. In this case, 32 is ok on the direct mod, for example, the integer 8, 32 is equal to 8 on the 8th mod of tmp[0], then the integer 8 is in the eighth bit of tmp[0] (from the right).

BitMap source code

 private long length;
    private static int[] bitsMap;
    private static final int[] BIT_VALUE = {0x00000001, 0x00000002, 0x00000004, 0x00000008, 0x00000010, 0x00000020,
            0x00000040, 0x00000080, 0x00000100, 0x00000200, 0x00000400, 0x00000800, 0x00001000, 0x00002000, 0x00004000,
            0x00008000, 0x00010000, 0x00020000, 0x00040000, 0x00080000, 0x00100000, 0x00200000, 0x00400000, 0x00800000,
            0x01000000, 0x02000000, 0x04000000, 0x08000000, 0x10000000, 0x20000000, 0x40000000, 0x80000000};

    public BitMap2(long length) {
        this.length = length;
        /**
         * According to the length, the required array size
         * Size equals when length%32=0
         * = length/32
         * When length% 32 > 0, the size is equal to
         * = length/32+l
         */
        bitsMap = new int[(int) (length >> 5) + ((length & 31) > 0 ? 1 : 0)];
    }

    /**
     * @param n The value to be set is n
     */
    public void setN(long n) {
        if (n < 0 || n > length) {
            throw new IllegalArgumentException("length value "+n+" is  illegal!");
        }
        // Find the subscript of the bitMap where the n is located, which is equivalent to "n/5"
        int index = (int) n>>5;
        // Find the offset of the value (find the remainder), which is equivalent to "n%31"
        int offset = (int) n & 31;
        /**
         * Equivalent to
         * int bits = bitsMap[index];
         * bitsMap[index]=bits| BIT_VALUE[offset];
         * For example, when n=3, set the 4th position of byte to 1 (count from 0, bitsMap[0] can represent 0-31, and each bit from left to right represents one digit)
         * bitsMap[0]=00000000 00000000 00000000 00000000  |  00000000 00000000 00000000 00001000=00000000 00000000 00000000 00000000 00001000
         * Namely: bitsMap[0]= 0 | 0x00000008 = 3
         *
         * For example, when n=4, set byte position 5 to 1
         * bitsMap[0]=00000000 00000000 00000000 00001000  |  00000000 00000000 00000000 00010000=00000000 00000000 00000000 00000000 00011000
         * Namely: bitsMap[0]=3 | 0x00000010 = 12
         */
        bitsMap[index] |= BIT_VALUE[offset];

    }
    /**
     * Get whether the value N exists
     * @return 1: Exists, 0: does not exist
     */
    public int isExist(long n) {
        if (n < 0 || n > length) {
            throw new IllegalArgumentException("length value illegal!");
        }
        int index = (int) n>>5;
        int offset = (int) n & 31;
        int bits = (int) bitsMap[index];
        // System.out.println("n="+n+",index="+index+",offset="+offset+",bits="+Integer.toBinaryString(bitsMap[index]));
        return ((bits & BIT_VALUE[offset])) >>> offset;
    }

BitMap application

1. Small variety of BitMap: 2-BitMap. Take a look at a small scenario: find out the non repeated integers among the 300 million integers. The memory limit is not enough to hold 300 million integers.

For this scenario, I can use 2-BitMap to solve, that is, allocate 2 bits for each integer, and use different combinations of 0 and 1 to identify special meanings. For example, 00 means that this integer has not appeared, 01 means that it appears once, 11 means that it has appeared many times, and then we can find out the repeated integer. The memory space required is twice that of normal BitMap, which is 300 million * 2/8/1024/1024=71.5MB.

The specific process is as follows:

Scan 300 million integers, group BitMap, first check the corresponding position in BitMap, if 00, it will become 01, if 01, it will become 11, if 11, it will remain unchanged, when 300 million integers are scanned, that is to say, the whole BitMap has been assembled. Finally, check BitMap to output the integer corresponding to bit 11.

2. The phone number is de duplicated. A certain file is known to contain some phone numbers. Each number is 8 digits. Count the number of different numbers.

The maximum 8-bit memory is 99999 999, about 99m bits and about 10 megabytes. (it can be understood that from 0-99999 999, each number corresponds to a bit, so only 99m bits are needed = = 1.2mbytes, so a small memory of about 1.2M is used to represent all 8-digit phones.)

BitMap problem

BitMap can be used to solve many problems during the interview, and then it can be used in many systems, which is a good way to solve problems.

But BitMap also has some limitations, so there will be other BitMap based algorithms to solve these problems.

  • Data collision. For example, when mapping a string to BitMap, there will be a collision problem. You can consider using Bloom Filter to solve this problem, Bloom Filter Use multiple Hash functions to reduce the probability of conflict.

  • Data is sparse. For another example, to store (10888798393452134) these three data, we need to build a BitMap with the length of 9999999, but only three data are saved in fact. At this time, there is a large space waste. If this problem occurs, it can be solved by introducing the Roaring BitMap.

Tags: Java

Posted on Tue, 09 Jun 2020 00:07:18 -0400 by mguili