009 - generation method of distributed ID

  • effect
Unique identification of data and information in distributed cluster system
  • target
  1. Global uniqueness: duplicate ID number is not allowed. Since it is the only identification, this is the most basic requirement.
  2. Trend increasing: in MySQL InnoDB engine, clustered index is used. Because most RDBMS use B-tree data structure to store index data, we should try to use orderly primary key to ensure write performance.
  3. Monotonic increment: ensure that the next ID must be greater than the previous one, such as transaction version number, IM increment message, sorting and other special requirements.
  4. Information security: if the ID is continuous, it's very easy for malicious users to pick it up, just download the specified URL in order; if it's the order number, it's more dangerous, and the competitors can directly know our daily order quantity. Therefore, in some application scenarios, ID is required to be irregular.
  5. It is better to include time stamp in distributed id, so that the generation time of distributed id can be quickly understood in development
  6. High availability: if my user sends a request to obtain a distributed id, then your server must guarantee to create a distributed id for me in 99.999% cases
  7. Low latency: if my user gives you a request to obtain a distributed id, then your server will create a distributed id for me faster
  8. High QPS: This is that users have 100000 requests to create distributed IDS at once, and they pass at the same time. Then your server should hold on. You should create 100000 distributed IDS for me at once
  • UUID
  1. Definition: universal unique identifier. It is the only random 32-bit length data, and it is an unordered string of data. According to the standards developed by the Open Software Foundation (OSF), the UUID is generated by using the Ethernet card address, nanosecond time, chip ID code and many possible numbers. The bottom layer of UUID is composed of a set of 32-digit hexadecimal digits.
  2. Format: the format is 8-4-4-4-12, with 36 characters in total (i.e. 32 English letters and four hyphens), such as a23e4567-e79b-12d3-a456-426655440001 (xxxxxxx-xxxx-mxxx-nxxx-xxxxxxx). Where M optional values are 1, 2, 3, 4, 5, and the sub table represents 5 different versions
Version 1:0001. Based on time and MAC address. Because the MAC address is used, it can ensure uniqueness, but it also exposes the MAC address, so the privacy is not good enough.
Version 2:0010. UUID of DCE security. This version is not detailed in the specification, so there is no specific implementation.
Version 3:0011. Based on namespace (MD5). The user specifies a namespace and a string to generate UUID through MD5 hash. The string itself needs to be unique.
Version 4: 0100. Based on random number. Although it is based on random number, the possibility of repetition is negligible, so this version is also frequently used.
Version 5: 0101. Based on namespace (SHA1). Similar to Version 3, but the hash function programs SHA1.

The four bits at the beginning of N represent the UUID variant. The variant is to be compatible with the past UUIDs and to cope with future changes. At present, there are several known variants. Because the UUIDs currently in use are variant1, the value can only be one of 8, 9, a and b (corresponding to 1000100110101011 respectively)

variant 0: 0xxx.  Reserved for backward compatibility.
variant 1: 10xx.  Currently in use.
variant 2: 11xx.  Reserved for earlier Microsoft guids.
variant 3: 111x.  Reserved for future expansion. Currently not used.
  1. Use scenario

Because UUID can guarantee uniqueness, it can uniquely identify the existence of something. For example, Alibaba cloud uses it as the unique ID of each SMS. But it is not suitable for distributed ID for the following reasons:

-First of all, the distributed id is generally used as the primary key, but the official recommendation for installing mysql is that the primary key should be as short as possible. Each UUID is very long, so it is not recommended
 - Since the distributed id is the primary key, and then the primary key contains the index, and then the mysql index is realized through the b + tree. Every time a new UUID data is inserted, in order to optimize the query, the b + tree at the bottom of the index will be modified. Because the UUID data is unordered, every time the UUID data is inserted, the b + tree in the primary key city will be greatly modified, which is very different good
 -Information insecurity: the algorithm of generating UUID based on MAC address may cause MAC address leakage, which has been used to find the maker's location of Melissa virus.
  • MySQL database auto increment ID

For the click MySQL, as long as the field is set to auto_increment is OK, but for distributed database, you need to set the start value and step size respectively according to the number of machines.

Take the distributed cluster database of N machines as an example. Number from 1 to N, starting value can be set to 1, 2, 3 N. The step size is set to n. In this way, the value of server ID for number 1 is 1, N+1, 2N+1 For server ID number 2 is 2, N+2, 2N+2 For server ID number n, the values are N, N+N, N+2N In this way, the ID in the distributed database cluster can be increased automatically without repetition.

Server 1
auto-increment-increment = 1
auto-increment-offset = N

Server 2
auto-increment-increment = 2
auto-increment-offset = N

...

Server N
auto-increment-increment = N
auto-increment-offset = N

However, the incremental ID of the database is not suitable for distributed ID processing. The reasons are as follows

1. It is not conducive to the horizontal expansion of the cluster. When the number of machines needs to be increased with the growth of business volume, this scheme will become very troublesome.
2. Every time you get an ID, you have to read and write the database once, which greatly affects the performance and does not conform to the rules of low latency and high QPS in distributed ID (if you go to the database to get an ID in high concurrency, it will greatly affect the performance)
3. The ID is self increasing and continuous, which is easy to be guessed and attacked, and the security is not guaranteed.
  • Redis generates distributed ID

In the case of a single machine, you only need to use the incr atomic operation of redis, so the ID generated is also continuously increasing. However, in actual production, redis is generally deployed in a cluster mode. Here we mainly talk about the generation of Distributed IDS in a distributed cluster environment.

We use 64 binary bits to represent an ID, and the result is as follows

The first bit represents a positive number, then the 41 bit represents a millisecond, then the 12 bit represents the number of nodes, and the last 10 bits represents the serial number per millisecond. From the structure, we can calculate that with the distributed ID of the above structure, the number of 41 bit milliseconds can be about 20 years, the number of cluster nodes can reach 4096, and each node can generate 1024 IDS per millisecond. java can be used in the following ways

/**
 * The number of milliseconds passed in the current time, node number and serial number generation ID
 * 
 * @param miliSecond Msec 
 * @param shardId    Node No. (0-4095)
 * @param seq        Serial number per node in milliseconds (0-1023)
 * @return ID
*/
public static long generateId(long miliSecond, long shardId, long seq) {
	return (miliSecond << (12 + 10)) | (shardId << 10) | seq;
}

The seq parameter can be obtained through the incr command of redis. If the same node calls this method to generate ID multiple times in the same millisecond (for example, the current millisecond number is 1592902770010, and the node number is 99), the key can be set to 99:1592902770010, and the seq value of the current node in the current time can be obtained by the return value of the incr operation of the key, with the range of 0-1023. If atomic operation is involved, multiple redis operations can be put in the same script by executing (eval or evalsha)lua script through redis. Both the miliSecond parameter and the seq parameter are returned from redis, which ensures that the time is consistent.

Using Redis cluster to generate distributed ID has greatly exceeded the database self increasing ID in performance. The disadvantage is that

1. When the cluster is horizontally expanded, it will also be troublesome.
2. Depending on the new component Redis, the introduction of Redis and the maintenance, high availability and other aspects after the introduction need a certain cost (of course, if the original system has used Redis, this part will carry the meaning).
  • Snowflake algorithm


The ID generated by the snowflake algorithm is roughly the same as the Redis distributed ID above, except that the number of nodes is represented by 10 digits (also represented by 5-digit data center + 5-digit machine number), and the serial number of each node in every millisecond is represented by 12 digits. In this way, the number of nodes is reduced, the serial number is increased, and the total number of IDs generated per node per millisecond is the same.

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Random;
import java.util.concurrent.atomic.AtomicLongArray;

/**
 * Snowflake algorithm implementation
 * The structure represented by 64 is:
 *  The first digit is 0, indicating a positive number
 *  Then the 41 bits represent the time stamp milliseconds
 *  Then 10 digits indicate the machine number
 *  The last 12 digits represent the serial number
 *  
 *  In order to prevent clock callback, an ID ring is saved in the code
 */
public class SnowflakeTest {

    /**
     * Number of digits occupied by machine number
     */
    private static final long MACHINE_BITS = 10;

    /**
     * Number of digits of serial number
     */
    private static final long SEQUENCE_BITS = 12;

    /**
     * The number of digits that the time stamp moves left in the ID: machine digits + serial number digits
     */
    private static final long TIMESTAMP_SHIFT_COUNT = MACHINE_BITS + SEQUENCE_BITS;

    /**
     * Number of digits of machine number shifted left in ID: number of digits of serial number
     */
    private static final long MACHINE_ID_SHIFT_COUNT = SEQUENCE_BITS;

    /**
     * Machine number mask
     */
    private static final long MACHINE_MASK = ~(-1L << MACHINE_BITS);

    /**
     * Serial number mask
     */
    private static final long SEQUENCE_MASK = ~(-1L << SEQUENCE_BITS);

    /**
     * Start timestamp
     */
    private static long START_THE_WORLD_MILLIS;

    /**
     * Machine number
     */
    private long machineId;


    /**
     * ID Ring, size 200.
     * It can save the ID of the last time in 200 milliseconds, which depends on when the time goes back.
     * To solve the key of time backoff, we can also reduce the competition of milliseconds switching in the case of multithreading.
     */
    private AtomicLongArray idCycle = new AtomicLongArray(200);


    static {
        //2020-01-01 00:00:00
        START_THE_WORLD_MILLIS = 1577808000000L;
    }

    /**
     * init Method to get the local machineId
     *
     * @throws Exception
     */
    public void init() throws Exception {
        if (machineId == 0L) {
            //Here it is temporarily randomly selected
            Random random = new Random();
            random.setSeed(System.currentTimeMillis());

            machineId = random.nextInt((int)MACHINE_MASK);
        }
        //The obtained machineId cannot exceed the maximum value
        if (machineId < 0L || machineId > MACHINE_MASK) {
            throw new Exception("the machine id is out of range,it must between 0 and 1023");
        }
    }

    /**
     * Generate distributed ID
     */
    public long genID() {
        do {
            // Gets the current timestamp, which is the number of milliseconds of the current time minus start the world
            long timestamp = System.currentTimeMillis() - START_THE_WORLD_MILLIS;

            // Gets the subscript of the current time in idCycle, which is used to get the previous ID in the ring
            int index = (int) (timestamp % idCycle.length());

            long idInCycle = idCycle.get(index);

            //Calculate the time stamp of the last ID through the idInCycle obtained in idCycle
            long timestampInCycle = idInCycle >> TIMESTAMP_SHIFT_COUNT;

            // If timestampInCycle does not have a time stamp set, or the time stamp is less than the current time, a new time stamp should be set
            if (idInCycle == 0 || timestampInCycle < timestamp) {
                long id = timestamp << TIMESTAMP_SHIFT_COUNT | machineId << MACHINE_ID_SHIFT_COUNT;
                // CAS is used to ensure that the ID is not repeated under this condition
                if (idCycle.compareAndSet(index, idInCycle, id)) {
                    return id;
                }
            }

            // If the current time stamp is equal to the time stamp of idCycle, it indicates the ID generated in the same millisecond
            // If the current time stamp is less than the time stamp of idCycle, it indicates the situation of clock callback
            if (timestampInCycle >= timestamp) {
                long sequence = idInCycle & SEQUENCE_MASK;
                if (sequence >= SEQUENCE_MASK) {
                	//If the serial number of the current timestamp is full within milliseconds, the ID generation will be delayed to the next millisecond
                    System.out.println("over sequence mask :" + sequence);
                    continue;
                }
                long id = idInCycle + 1L;

                // CAS is used to ensure that the ID is not repeated under this condition
                if (idCycle.compareAndSet(index, idInCycle, id)) {
                    return id;
                }
            }
        } while (true);
    }

    /**
     * Get the machine number where it is generated by distributed ID
     *
     * @param id
     * @return
     */
    public static long getMachineId(long id) {
        return id >> MACHINE_ID_SHIFT_COUNT & MACHINE_MASK;
    }

    /**
     * Obtain the serial number of its generation through the distributed ID
     *
     * @param id
     * @return
     */
    public static long getSequence(long id) {
        return id & SEQUENCE_MASK;
    }

    /**
     * Get the time stamp of its generation through the distributed ID
     *
     * @param id
     * @return
     */
    public static long getTimestamp(long id) {
        return (id >>> TIMESTAMP_SHIFT_COUNT) + START_THE_WORLD_MILLIS;
    }
}

Test code

	//Test code
    public static void main(String[] args) {
        SnowflakeTest test = new SnowflakeTest();

        try {
            test.init();
        } catch (Exception e) {
            e.printStackTrace();
        }
		//Loop 5 times in 10 threads, generating 50 ID S in total
        for (int i = 0; i < 10; i++) {
            new Thread(()->{
                for (int j = 0; j <5 ; j++) {
                    parseId(test.genID());
                }
            }, "thread_name"+String.valueOf(i)).start();
//            parseId(test.genID());
        }

    }

    private static void parseId(long id) {
        long miliSecond = getTimestamp(id);
        long machineId = getMachineId(id);
        long seq = getSequence(id);

        String date = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss S").format(new Date(miliSecond));


        System.out.println(Thread.currentThread().getName()+" [ID:"+id+"] [MachineID:"+machineId+"] [Sequence:"+seq+"] [Date:"+date+"]");
    }

Because snowflake algorithm depends on system clock, we use ID ring to solve this problem.

Reduced snow algorithm


As shown in the figure above, we can use 53 bits to form the ID, the first bit represents a positive number, then 33 bits represents a second (note here is a second), then 4 bits represents the number of nodes, and finally 15 bits represents the ID sequence generated by each node per second. This is enough for small and medium-sized companies. Of course, it can also be adjusted according to the actual business needs to meet the business needs. The difference in ID generation lies in the difference in the number of bits shifted to the left of each part. Take the 53 bit distributed ID above as an example to get the current system's seconds shifted to the left by 19 bits and the number of machine nodes shifted to the left by 15 bits.

The ID generated by snowflake algorithm does not depend on any third-party components, so it is convenient to use and extend

  • summary

To sum up, we can see that the snowflake algorithm is the most suitable one according to several goals of the distributed ID generation algorithm. For clock call back, use the above method to solve the problem. If the system is restarted, there will be no problem. There is no need to persist the ID ring, and there is no other better solution. Welcome to discuss.

Tags: Redis Database Java MySQL

Posted on Sun, 28 Jun 2020 01:06:53 -0400 by plzhelpme