Research on MySQL coding

preface

It was in 2002, and MySQL 4.1 began to support UTF-8 coding RFC 2279 It is stipulated that the code uses 1 ~ 6 bytes to store a single character, and MySQL developers checked it in September of the same year Source code The UTF-8 encoding is changed from the maximum 6 bytes to 3 bytes, so there is the current UTF8MB3. As we all know, UTF8MB3 cannot store SMP (supplementary multi lingual plane) characters, and MySQL was later based on RFC 3629 standard Added UTF8MB4. At this time, we only need to set the encoding type of table and column to UTF8MB4. However, in fact, the encoding type specified through SET NAMES will also affect the transmission of SMP characters when establishing a MySQL connection. This paper mainly analyzes the logic specified by the code when connecting to MySQL

Connector source code analysis

The following versions are based on MySQL connector Java dependency package 5.1.46 and 5.1.47. Other versions may have different implementations.

The reason for the above problem is that when MySQL instantiates a Connector, it will pass a Set UTF8MB4_ Whether indexes contains the variable serverCharsetIndex to obtain a Boolean variable useutf8mb4. In the subsequent execution of SET NAMES, it will judge whether to use UTF8MB4 or UTF8 through the useutf8mb4 variable. Next, let's follow the code to see how these variables are assigned.

mysql-connector-java 5.1.46

  • CharsetMapping

    When the CharsetMapping class is initialized, its internal static code blocks are executed

    public static final int MAP_SIZE = 2048;
    ......
    // Save MySQL codes and sorting rules
    Collation[] collation = new Collation[MAP_SIZE];
    ......
    collation[33] = new Collation(33, "utf8_general_ci", 1, MYSQL_CHARSET_NAME_utf8);
    ......
    collation[45] = new Collation(45, "utf8mb4_general_ci", 1, MYSQL_CHARSET_NAME_utf8mb4);
    collation[46] = new Collation(46, "utf8mb4_bin", 0, MYSQL_CHARSET_NAME_utf8mb4);
    ......
    collation[255] = new Collation(255, "utf8mb4_0900_ai_ci", 0, "utf8mb4");
    ......
    Set<Integer> tempUTF8MB4Indexes = new HashSet<Integer>();
    Collation notUsedCollation = new Collation(0, COLLATION_NOT_DEFINED, 0, NOT_USED);
    for (int i = 1; i < MAP_SIZE; i++) {
      Collation coll = collation[i] != null ? collation[i] : notUsedCollation;
      COLLATION_INDEX_TO_COLLATION_NAME[i] = coll.collationName;
      COLLATION_INDEX_TO_CHARSET[i] = coll.mysqlCharset;
      String charsetName = coll.mysqlCharset.charsetName;
      ......
      // Find the character set name 'utf8mb4' and save the corresponding table below to tempUTF8MB4Indexes
      if (charsetName.equals(MYSQL_CHARSET_NAME_utf8mb4)) {
        tempUTF8MB4Indexes.add(i);
      }
    }
    CHARSET_NAME_TO_COLLATION_INDEX = Collections.unmodifiableMap(charsetNameToCollationIndexMap);
    UTF8MB4_INDEXES = Collections.unmodifiableSet(tempUTF8MB4Indexes);
    Copy code

    Collision class

    class Collation {
      // subscript
      public final int index;
      public final String collationName;
      public final int priority;
      public final MysqlCharset mysqlCharset;
    
      public Collation(int index, String collationName, int priority, String charsetName) {
        this.index = index;
        this.collationName = collationName;
        this.priority = priority;
        this.mysqlCharset = CharsetMapping.CHARSET_NAME_TO_CHARSET.get(charsetName);
      }
    }
    Copy code

    First, an array of type collision will be initialized, and multiple collision instances will be created to save MySQL character Set (Charset) and Collation (collision). Next, we will traverse the array to find the subscript with the character Set name of UTF8MB4, and finally generate a read-only Set and assign it to UTF8MB4_INDEXES variable. At this time, UTF8MB4_INDEXES has all subscripts with the character Set name UTF8MB4.

  • MysqlIO

    When a MySQL connection is created, a TCP connection will be established with MySQL Server through MysqlIO class.

    When the TCP connection is established successfully, MySQL Server will return a Greeting package.

    At this point, call the doHandshake method in the class:

    void doHandshake(String user, String password, String database) throws SQLException {
      // Read the first packet
      this.checkPacketSequence = false;
      this.readPacketSequence = 0;
      Buffer buf = readPacket();
      // Get the protocol version
      this.protocolVersion = buf.readByte();
      ......
      this.serverCapabilities = 0;
      // read capability flags (lower 2 bytes)
      if (buf.getPosition() < buf.getBufLength()) {
        this.serverCapabilities = buf.readInt();
      }
      // When version
      if ((versionMeetsMinimum(4, 1, 1) || ((this.protocolVersion > 9) && (this.serverCapabilities&CLIENT_PROTOCOL_41) != 0))) {
        // read character set (1 byte)
        this.serverCharsetIndex = buf.readByte() & 0xff;
        ......
      }
      ......
    }
    Copy code

    Calling the readPacket method will get the Greeting package data returned by MySQL Server. One byte represents serverCharsetIndex (as shown in the figure above, serverCharsetIndex is 21 after f7 and 33 when converted to hexadecimal), and conduct & operation with 0xff to ensure that its value will not be greater than 255.

  • ConnectionImpl

    Similarly, when a MySQL connection is created, the configureClientCharacterSet method in ConnectionImpl will be called to set the client character encoding type.

    private boolean configureClientCharacterSet(boolean dontCheckServerMatch) throws SQLException {
      // From MySQL attribute characterEncoding
      String realJavaEncoding = getEncoding();
      boolean characterSetAlreadyConfigured = false;
      if (versionMeetsMinimum(4, 1, 0)) {
        characterSetAlreadyConfigured = true;
        // Set the variable useUnicode to true and from the MySQL property useUnicode
        setUseUnicode(true);
        configureCharsetProperties();
        // we need to do this again to grab this for versions > 4.1.0
        realJavaEncoding = getEncoding();
        ......
        // Whether to use unicode. MySQL version > = 4.1.0 or set the property useUnicode=true.
        // Returns true
        if (getUseUnicode()) {
          // Setting the property characterEncoding is not null
          if (realJavaEncoding != null) {
            if (realJavaEncoding.equalsIgnoreCase("UTF-8") || realJavaEncoding.equalsIgnoreCase("UTF8")) {
              // Is MySQL version > = 5.5.2? This version supports utf8mb4
              boolean utf8mb4Supported = versionMeetsMinimum(5, 5, 2);
              // UTF8MB4_ Does indexes contain serverCharsetIndex
              boolean useutf8mb4 = utf8mb4Supported && f(CharsetMapping.UTF8MB4_INDEXES.contains(this.io.serverCharsetIndex));
              // The default is false unless the useOldUTF8Behavior property is set
              if (!getUseOldUTF8Behavior()) {
                // The characterSetNamesMatches method mainly judges characters_ set_ client,character_ set_ The values of the two properties, connection
                if (dontCheckServerMatch || !characterSetNamesMatches("utf8") || (utf8mb4Supported && !characterSetNamesMatches("utf8mb4"))) {
                  execSQL(null, "SET NAMES " + (useutf8mb4 ? "utf8mb4" : "utf8"), -1, null, DEFAULT_RESULT_SET_TYPE,DEFAULT_RESULT_SET_CONCURRENCY, false, this.database, null, false);
                  this.serverVariables.put("character_set_client", useutf8mb4 ? "utf8mb4" : "utf8");
                  this.serverVariables.put("character_set_connection", useutf8mb4 ? "utf8mb4" : "utf8");
                }
              }
            }
          }
        }
      }
      ......
    }
    Copy code

    Let's take a look at the value serverCharsetIndex

    According to the returned result of the packet, the value of serverCharsetIndex should be 33. In the collision array, the character set corresponding to subscript 33 is UTF8, so the subscript must not exist in utf8mb4_ In the indexes set. The value of the variable useutf8mb4 is false. When the execSQL method is called later, the complete SQL should be SET NAMES utf8.

mysql-connector-java 5.1.47

5.1.47 is slightly different from 5.1.46 in the source code implementation. Let's take a look at its changes to the configureClientCharacterSet method.

private boolean configureClientCharacterSet(boolean dontCheckServerMatch) throws SQLException {
  ......
  // Change 1
  // Get the value of the connectioncollision property through getconnectioncollision(). If it is not empty, it will enter the loop
  if (!getUseOldUTF8Behavior() && !StringUtils.isNullOrEmpty(getConnectionCollation())) {
    for (int i = 1; i < CharsetMapping.COLLATION_INDEX_TO_COLLATION_NAME.length; i++) {
      // Collation_ INDEX_ TO_ Collation_ The name collection holds the collationName in the Collation object, such as' utf8mb4 '_ general_ ci'
      if (CharsetMapping.COLLATION_INDEX_TO_COLLATION_NAME[i].equals(getConnectionCollation())) {
        connectionCollationSuffix = " COLLATE " + CharsetMapping.COLLATION_INDEX_TO_COLLATION_NAME[i];
        // Find the charsetName corresponding to the collision and assign it to connectionCollationCharset
        connectionCollationCharset = CharsetMapping.COLLATION_INDEX_TO_CHARSET[i].charsetName;
        realJavaEncoding = CharsetMapping.getJavaEncodingForCollationIndex(i);
      }
    }
  }
  ......
  if (getUseUnicode()) {
    if (realJavaEncoding != null) {
      if (realJavaEncoding.equalsIgnoreCase("UTF-8") || realJavaEncoding.equalsIgnoreCase("UTF8")) {
        boolean utf8mb4Supported = versionMeetsMinimum(5, 5, 2);
        // Change 2
        String utf8CharsetName = connectionCollationSuffix.length() > 0 ? connectionCollationCharset : (utf8mb4Supported ? "utf8mb4" : "utf8");
        if (!getUseOldUTF8Behavior()) {
          if (dontCheckServerMatch || !characterSetNamesMatches("utf8") || (utf8mb4Supported && !characterSetNamesMatches("utf8mb4")) ||(connectionCollationSuffix.length() > 0 && !getConnectionCollation().equalsIgnoreCase(this.serverVariables.get("collation_server")))) {
            execSQL(null, "SET NAMES " + utf8CharsetName + connectionCollationSuffix, -1, null, DEFAULT_RESULT_SET_TYPE,DEFAULT_RESULT_SET_CONCURRENCY, false, this.database, null, false);
            this.serverVariables.put("character_set_client", utf8CharsetName);
            this.serverVariables.put("character_set_connection", utf8CharsetName);
          }
        }
      }
    }
  }
  ......
}
Copy code

From the code above, we can see that it no longer depends on UTF8MB4_INDEXES makes judgment based on the following two points:

  1. Whether connectionCollationSuffix has a value. If yes, the default code is connectionCollationCharset.
  2. Whether the value of the utf8mb4Supported variable is true (that is, whether the MySQL version is > = 5.5.2). If it is true, the code UTF8MB4 is used.

summary

Without changing other settings, we can upgrade the MySQL connector java package to support UTF8MB4. Can UTF8MB4 be supported without upgrading dependent packages? The answer is yes!

Set the property com.mysql.jdbc.faultInjection.serverCharsetIndex

Through the above source code, we can see that the com.mysql.jdbc.faultInjection.serverCharsetIndex property is obtained in the configureClientCharacterSet method and assigned to serverCharsetIndex. Then we only need to set the value of this property to the subscript value of UTF8MB4. For example, 45, 46, 224... (this attribute is only used for testing. We recommend that you do not use this method)

Character of MySQL Server_ set_ Server is set to utf8mb4

In fact, the serverCharsetIndex in the code is based on character_ set_ The value of the server attribute is mapped from the character of MySQL Server_ set_ Change the server to UTF8MB4 and then grab the package

At this time, the value of serverCharsetIndex should be ff, which is converted to hexadecimal 255. collation[255] = new Collation(255, "utf8mb4_0900_ai_ci", 0, "utf8mb4"); The character set corresponding to 255 is utf8mb4.

Finally, this article mainly focuses on the problem of 'MySQL connection setting encoding to utf8 causes SMP character insertion failure'. Please understand if there is any ambiguity:)

Tags: Java Database MySQL Back-end Programmer

Posted on Tue, 07 Dec 2021 02:36:02 -0500 by misterguru