A simple search engine based on Mysql

preface

Some time ago, because of the project requirements, we need to search chat records according to keywords. Isn't this the function of a search engine?
So I first thought of ElasticSearch distributed search engine, but for some reasons, the company's server resources are relatively tight, there is no additional machine to deploy a set of ElasticSearch service, and the online time is also tight, and the amount of data is not large. Then I thought of Mysql full-text index.

brief introduction

In fact, Mysql has long supported full-text indexing, but it has only supported English retrieval. Since version 5.7.6, Mysql has built-in ngram full-text parser to support Chinese, Japanese and Korean word segmentation.

Mysql full-text index adopts the principle of inverted index. In inverted index, the keyword is the primary key, and each keyword corresponds to a series of files. This keyword appears in these files. In this way, when the user searches for a keyword, the sorter locates the keyword in the inverted index, and can immediately find all the files containing the keyword.

This article tests that based on MySQL version 8.0, the database engine adopts InnoDB

ngram full text parser

ngram is a sequence of N consecutive words in a text. ngram full-text parser can segment text, and each word is a continuous sequence of n words. For example, use ngram full-text parser to segment "Hello pretty boy":

n=1: 'you', 'good', 'Pretty', 'Son' 
n=2: 'Hello', 'How beautiful', 'handsome young man' 
n=3: 'You are beautiful', 'What a pretty boy' 
n=4: 'Hello, pretty boy'

Using global variable ngram in MySQL_ token_ Size to configure the size of n in ngram. Its value range is 1 to 10, and the default value is 2. Usually ngram_token_size is set to the minimum number of words to query. If you need to search for words, put ngram_ token_ Set size to 1. When the default value is 2, the search word will not get any results. Because Chinese words are at least two Chinese characters, the default value of 2 is recommended.

You can view the default NGram of Mysql through the following command_ token_ Size:

show variables like 'ngram_token_size'

There are two ways to set the global variable NGram_ token_ Value of size:

1. When starting the mysqld command, specify:

mysqld --ngram_token_size=2

2. Modify the Mysql configuration file my.ini and add a line of parameters at the end:

ngram_token_size=2

Create full-text index

1. Create full-text index when creating table

CREATE TABLE `article` (
  `id` bigint NOT NULL,
  `url` varchar(1024) COLLATE utf8mb4_general_ci NOT NULL DEFAULT '',
  `title` varchar(256) COLLATE utf8mb4_general_ci NOT NULL DEFAULT '',
  `source` varchar(32) COLLATE utf8mb4_general_ci DEFAULT '',
  `keywords` varchar(32) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `publish_time` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `title_index` (`title`) WITH PARSER `ngram`
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

2. Through alter table

ALTER TABLE article ADD FULLTEXT INDEX title_index(title) WITH PARSER ngram;

3. create index

CREATE FULLTEXT INDEX title_index ON article (title) WITH PARSER ngram;

Retrieval method

1. Natural language retrieval (NATURAL LANGUAGE MODE)

Natural language mode is MySQL's default full-text retrieval mode. Natural language patterns cannot use operators, and cannot specify complex queries such as keywords that must appear or must not appear.

Example

select * from article where MATCH(title) AGAINST ('Beijing Tourism' IN NATURAL LANGUAGE MODE);

// No mode is specified. The natural language mode is used by default
select * from article where MATCH(title) AGAINST ('Beijing Tourism');

It can be seen that under this mode, the content containing "Beijing" or "tourism" can be searched according to the "Beijing Tourism" search, because it is divided into two keywords according to natural language.

The results returned in the above example will be sorted automatically according to the matching degree. The high matching degree is in the front, and the matching degree is a non negative floating-point number.

Example

// View matching
select * , MATCH(title) AGAINST ('Beijing Tourism') as score from article where MATCH(title) AGAINST ('Beijing Tourism' IN NATURAL LANGUAGE MODE);

2. BOOLEAN MODE

Boolean retrieval mode can use operators, which can support complex queries that specify whether keywords must or cannot appear, or whether the weight of keywords is high or low.

Example

// No operator
// Include "appointment" or "Introduction"
select * from article where MATCH(title) AGAINST ('Dating strategy' IN BOOLEAN MODE);

// Using operators
// Must include "appointment", can include "Introduction"
select * from article where MATCH(title) AGAINST ('+Dating strategy' IN BOOLEAN MODE);

More operator examples:

'Dating strategy' 
No operator, meaning or, contains either "appointment" or "Introduction"

'+Appointment +Introduction'
Must contain both words

'+Dating strategy'
"Appointment" must be included, but if "Introduction" is also included, the matching degree is higher.

'+Appointment -Introduction'
Must include appointment and cannot include introduction.

'+Appointment ~Introduction'
"Appointment" must be included, but if "strategy" is also included, the matching degree is lower than that of records without "strategy".

'+Appointment +(>Introduction <skill)'
The query must contain records of "appointment" and "strategy" or "appointment" and "skill", but the matching degree of "appointment strategy" is higher than that of "dating skill".

'Appointment*'
The query contains records that begin with appointment.

'"Dating strategy"'
Use double quotation marks to enclose the words to be searched. The effect is similar to like '%Dating strategy%',
For example, "Introduction to dating" will be matched, while "Introduction to dating" will not be matched.

Compare with Like

Compared with like query, full-text index has the following advantages:

  • like only performs fuzzy matching, but the full-text index provides some syntactic and semantic query functions. It will segment the string to be queried, which depends on the thesaurus of Mysql.
  • Full text indexing can set the minimum and maximum length of words and words to be ignored, which can be set.
  • Using the full-text index to query a string in a column will return the matching degree, which can be understood as the number of matching keywords, which is a floating-point number.

Moreover, the performance of full-text retrieval is better than that of like query

The following tests are conducted with data of about 50w:

// like query
select * from article where title like '%Beijing%';

// Full text index query
select * from article where MATCH(title) AGAINST ('Beijing' IN BOOLEAN MODE);

It can be seen that the like query is 1.536s and the full-text index query is 0.094s, which is about 16 times faster.

summary

Full text indexing can search quickly, but it also has the overhead of maintaining the index. The larger the field length, the larger the full-text index created, which will affect the throughput of DML statements. When the amount of data is small, full-text indexing can be used for search, which is simple and convenient. However, if the amount of data is large, it is recommended to use a special search engine ElasticSearch to do this.

Tags: MySQL ElasticSearch Programmer search engine

Posted on Fri, 10 Sep 2021 03:37:31 -0400 by jahwobbler