Comparison rules under utf8 character set

Preface:

In mysql, utf8 and utf8mb4 are commonly used character sets. The two character sets are similar. Utf8 is an alias of utf8mb3, so later mentioned in MySQL that utf8 means using 1-3 bytes to represent a character. If you use 4 bytes to encode a character, such as storing some emoji expressions, you need to use utf8mb4. In fact, each character set corresponds to several comparison rules (which can also be translated into collation or collation rules. In English, COLLATE). In the same character set, using different comparison rules will affect the comparison and sorting of character fields. Taking utf8 as an example, this paper introduces the differences of several commonly used comparison rules.

1. Overview of comparison rules under utf8

Let's first look at all the comparison rules supported under the utf8 character set:

mysql> SHOW COLLATION LIKE 'utf8\_%';
+--------------------------+---------+-----+---------+----------+---------+
| Collation                | Charset | Id  | Default | Compiled | Sortlen |
+--------------------------+---------+-----+---------+----------+---------+
| utf8_general_ci          | utf8    |  33 | Yes     | Yes      |       1 |
| utf8_bin                 | utf8    |  83 |         | Yes      |       1 |
| utf8_unicode_ci          | utf8    | 192 |         | Yes      |       8 |
| utf8_icelandic_ci        | utf8    | 193 |         | Yes      |       8 |
| utf8_latvian_ci          | utf8    | 194 |         | Yes      |       8 |
| utf8_romanian_ci         | utf8    | 195 |         | Yes      |       8 |
| utf8_slovenian_ci        | utf8    | 196 |         | Yes      |       8 |
| utf8_polish_ci           | utf8    | 197 |         | Yes      |       8 |
| utf8_estonian_ci         | utf8    | 198 |         | Yes      |       8 |
| utf8_spanish_ci          | utf8    | 199 |         | Yes      |       8 |
| utf8_swedish_ci          | utf8    | 200 |         | Yes      |       8 |
| utf8_turkish_ci          | utf8    | 201 |         | Yes      |       8 |
| utf8_czech_ci            | utf8    | 202 |         | Yes      |       8 |
| utf8_danish_ci           | utf8    | 203 |         | Yes      |       8 |
| utf8_lithuanian_ci       | utf8    | 204 |         | Yes      |       8 |
| utf8_slovak_ci           | utf8    | 205 |         | Yes      |       8 |
| utf8_spanish2_ci         | utf8    | 206 |         | Yes      |       8 |
| utf8_roman_ci            | utf8    | 207 |         | Yes      |       8 |
| utf8_persian_ci          | utf8    | 208 |         | Yes      |       8 |
| utf8_esperanto_ci        | utf8    | 209 |         | Yes      |       8 |
| utf8_hungarian_ci        | utf8    | 210 |         | Yes      |       8 |
| utf8_sinhala_ci          | utf8    | 211 |         | Yes      |       8 |
| utf8_german2_ci          | utf8    | 212 |         | Yes      |       8 |
| utf8_croatian_ci         | utf8    | 213 |         | Yes      |       8 |
| utf8_unicode_520_ci      | utf8    | 214 |         | Yes      |       8 |
| utf8_vietnamese_ci       | utf8    | 215 |         | Yes      |       8 |
| utf8_general_mysql500_ci | utf8    | 223 |         | Yes      |       1 |
+--------------------------+---------+-----+---------+----------+---------+

The naming of these comparison rules is quite regular. The specific rules are as follows:

  • The comparison rule name begins with the name of the character set it is associated with. As shown in the above figure, the comparison rule names of query results start with utf8.
  • It is followed closely by which language the comparison rule mainly applies to. For example, utf8 ﹣ Polish ﹣ CI represents the comparison of rules in Polish, utf8 ﹣ Spanish ﹣ CI is the comparison of rules in Spanish, and UTF8 ﹣ general ﹣ CI is a general comparison rule.
  • Name suffix means whether the comparison rule distinguishes accent and case in the language. The specific values can be used are as follows:
Suffix transitive verb describe
_ai accent insensitive Accent insensitive
_as accent sensitive Stress discrimination
_ci case insensitive Case insensitive
_cs case sensitive Case sensitive
_bin binary Binary comparison

For example, the comparison rule of utf8 ﹣ general ﹣ ci ends with ci, indicating that it is not case sensitive
Each character set has a Default comparison rule. The value of the Default column in the return result of show collection is YES, which is the Default comparison rule of the character set. For example, the Default comparison rule of utf8 character set is utf8 ﹣ general ﹣ CI.

Comparison rules can act on four levels: server level, database level, table level, and column level. The comparison rules at the server level are controlled by the collation ﹣ server parameter. If no explicit comparison rules are specified when creating databases, tables, and columns, the comparison rules at the previous level will be inherited. The following is an example statement for creating and modifying comparison rules for libraries, tables, and columns:

#Create database specify comparison rules modify comparison rules of database
 CREATE DATABASE database name
    [[DEFAULT] CHARACTER SET name]
    [[DEFAULT] COLLATE comparison rule name];

ALTER DATABASE database name
    [[DEFAULT] CHARACTER SET name]
    [[DEFAULT] COLLATE comparison rule name];

#Specify comparison rules when creating tables modify comparison rules for tables
 CREATE TABLE table name (column information)
    [[DEFAULT] CHARACTER SET name]
    [COLLATE comparison rule name]]

ALTER TABLE table name
    [[DEFAULT] CHARACTER SET name]
    [COLLATE comparison rule name]

#Specify the comparison rules of columns when creating modify the comparison rules of columns
 CREATE TABLE table name(
    Column name string type [CHARACTER SET character set name] [COLLATE comparison rule name],
    Other columns...
);

ALTER TABLE table name MODIFY column name string type [CHARACTER SET character set name] [COLLATE comparison rule name];

2. Comparison of several comparison rules

The default comparison rule in utf8 character set is utf8 ﹣ general ﹣ CI. There are three kinds of comparison rules used in daily life: utf8 ﹣ general ﹣ Ci, utf8 ﹣ Unicode ﹣ Ci, utf8 ﹣ bin. Other comparison rules are rarely used. Let's briefly understand the similarities and differences of these three comparison rules.

First of all, utf8 bin compares all characters directly as binary strings, and then compares them from the highest to the lowest. So it's obviously case sensitive. Utf8 ﹣ general ﹣ Ci and UTF8 ﹣ Unicode ﹣ CI are not case sensitive.

There is no substantial difference between utf8 ﹣ general ﹣ Ci and UTF8 ﹣ Unicode ﹣ CI in Chinese and English. The main feature of utf8 ﹣ Unicode ﹣ CI is its support for extension, that is, when one letter is regarded as equal to other letter combinations. For example, 'Di' is equal to 'ss' in German and some other languages. Utf8? General? CI is a legacy comparison rule and does not support extension. It can only compare characters one by one. This means that the utf8 ﹣ general ﹣ CI comparison rules are fast, but less accurate than utf8 ﹣ Unicode ﹣ CI.

Let's practice the similarities and differences of the following three comparison rules:

# Create tables to specify columns as different comparison rules
mysql> create table utf8_test (
    -> col_general varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci,
    -> col_unicode varchar(20) CHARACTER SET utf8 COLLATE utf8_unicode_ci,
    -> col_bin varchar(20) CHARACTER SET utf8 COLLATE utf8_bin
    -> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

 # Insert data each row data each field value is the same
 mysql> select * from utf8_test;
+-------------+-------------+---------+
| col_general | col_unicode | col_bin |
+-------------+-------------+---------+
| MySQL       | MySQL       | MySQL   |
| mysql       | mysql       | mysql   |
| A           | A           | A       |
| a           | a           | a       |
| aA          | aA          | aA      |
+-------------+-------------+---------+

# Query validation utf8 bin comparison rule is case sensitive
mysql> select * from utf8_test where col_general='mysql';
+-------------+-------------+---------+
| col_general | col_unicode | col_bin |
+-------------+-------------+---------+
| MySQL       | MySQL       | MySQL   |
| mysql       | mysql       | mysql   |
+-------------+-------------+---------+
mysql> select * from utf8_test where col_unicode='MySQL';
+-------------+-------------+---------+
| col_general | col_unicode | col_bin |
+-------------+-------------+---------+
| MySQL       | MySQL       | MySQL   |
| mysql       | mysql       | mysql   |
+-------------+-------------+---------+
mysql> select * from utf8_test where col_bin='mysql';       
+-------------+-------------+---------+
| col_general | col_unicode | col_bin |
+-------------+-------------+---------+
| mysql       | mysql       | mysql   |
+-------------+-------------+---------+

# Sorting finds that different sorting rules affect the order
mysql> select * from utf8_test order by col_general;
+-------------+-------------+---------+
| col_general | col_unicode | col_bin |
+-------------+-------------+---------+
| A           | A           | A       |
| a           | a           | a       |
| aA          | aA          | aA      |
| MySQL       | MySQL       | MySQL   |
| mysql       | mysql       | mysql   |
+-------------+-------------+---------+
mysql> select * from utf8_test order by col_bin;
+-------------+-------------+---------+
| col_general | col_unicode | col_bin |
+-------------+-------------+---------+
| A           | A           | A       |
| MySQL       | MySQL       | MySQL   |
| a           | a           | a       |
| aA          | aA          | aA      |
| mysql       | mysql       | mysql   |
+-------------+-------------+---------+

3. Selection and suggestions on comparison rules

For MySQL version 5.7, it is generally recommended to change the character set to utf8, and select the default utf8 "general" Ci as the comparison rule. Utf8 ﹣ general ﹣ CI is faster than utf8 ﹣ Unicode ﹣ Ci, but if your application has German, French or Russian, it is recommended to use utf8 ﹣ Unicode ﹣ CI. If a table or column field needs to be case sensitive, you can specify the table or field to use the utf8 bin comparison rule separately

Finally, the main contents of this paper are summarized in the way of mind map

Tags: MySQL Database emoji less

Posted on Fri, 27 Mar 2020 14:05:15 -0400 by Il-Belti