Introduction and use of word splitter of elastic search

What is a word splitter

In our last article, we talked about the inverted index. For example, when we search the bright moon, we may find the ancient poem Jingyesi according to the inverted index. Right? The bright moon looks down and the moonlight is the word divided by the word splitter. My understanding is that we mark the word tag of a chapter through the word splitter, and then we can find the corresponding title according to the tag, For example, if there is no word segmentation for the new mobile phone Huawei, we may not find the new mobile phone Huawei because of the order.

Composition of word splitter

  • Character filter
    Before segmenting a paragraph of text, you should first clean the original data, such as html tags. What I want is a word inside. For example, I'm a handsome man. In fact, I only need to get three words I'm handsome to the word splitter
  • Word breakers are tokenizers
    The string filter gives the content [I'm a handsome guy] to the word splitter, and the word splitter starts word segmentation, for example
    I
    yes
    One
    handsome guy
    ah

But there are some words that are not keywords. It is meaningless for you to search with this word. For example, what should you do if you want to search for me? In order to make the search more efficient and make users look more professional, the keywords we get should be
cold
What should I do?
Instead of how to do ah or ah, so you need to filter the results after word segmentation. At this time, you need to use word segmentation filter

  • token filers are word segmentation filters that filter useless words

Elastic supports nine different word segmentation modes by default

  • Stardard (filter punctuation)

  • simple (filter punctuation and numbers, leaving only letters)

  • whitespace (space participle, no filtering content)

  • stop (filter punctuation marks, numbers, filter modal particles, pause words)

  • Keyword (take the content as a whole without any processing, so this is why keyword is used for accurate matching)

  • Pattern (regular matching, refer to the regular matching rules of java)
    In this chestnut, only numbers are filtered

  • fingerprint (convert to lowercase, remove re filter stop words, sort)

  • Word splitter supporting more than 30 common languages (without Chinese)

  • Support custom word splitter

IK word breaker

  • Here we explain again why we use participles. If we don't use participles, we can actually search for elements. For example, when I search Huawei, the documents I love China are also searched. Obviously, I don't want them, but why is this situation? Let's see what the Chinese returned participle is


We found that es is a single split for Chinese characters by default. When you search Huawei, you originally search Huawei related items. After a single split, you will find both Bao Hanhua and Wei. This is obviously not what we want, so we need word segmentation to achieve more accurate matching

  • ElasticSearch is born with English word segmentation, because foreigners are born with a foreign language, so what about Chinese? The IK word splitter was released in version 1.0 in 2006.
    IK Analyzer is an open source, lightweight Chinese word segmentation toolkit based on java language. It is integrated into elasticsearch by a person named medcl (Zeng Yong, elastic development engineer and preacher, head of elasticsearch open source community, joined elastic in 2015) and supports user-defined words.
    Installation of IK word splitter search above GitHub https://github.com/medcl/elasticsearch-analysis-ik Just pay attention to the consistency of the version. Be sure to go to GitHub or the official website to download the package. The blogger was bitten by the package of a third party for two or three hours

  • After loading the ik participle, we'll check it again

  • However, you cannot search Huawei at this time, because the word segmentation index of Huawei is not established for I love Huawei when creating the document. We delete and recreate it

When searching Huawei, I love China entry did not appear, but only one I love Huawei was found. Remember, the inverted index is generated when the document is created, so when we create the document, we'd better outline the data model you need first, select the type of data segment and the appropriate word segmentation according to the model, and have the appropriate structure to have the appropriate index, The appropriate index will provide the appropriate distribution service. Before, I saw a friend insert data without mapping. In fact, the system also dynamically generates mapping at this time, but this mapping does not necessarily meet your requirements. For example, in the word segmentation we just mentioned, when you create an index, you create an index according to a single word. At this time, if you search Huawei as a word to match, you will not match it.

Configuration of IK word breaker

  • Let's look at the catalogue
[elastic@localhost config]$ ls
extra_main.dic  extra_single_word.dic  extra_single_word_full.dic  extra_single_word_low_freq.dic  extra_stopword.dic  IKAnalyzer.cfg.xml  main.dic  preposition.dic  quantifier.dic  stopword.dic  suffix.dic  surname.dic
[elastic@localhost config]$ pwd
/home/elastic/elasticsearch2/plugins/ik/config

IKAnalyzer.cfg.xml: used to configure custom thesaurus

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer Extended configuration</comment>
        <!--Users can configure their own extension dictionary here. Both relative and absolute paths are OK,These words may not exist in the original dictionary, such as mushroom, blue, thin ghost animal, etc -->
    <entry key="ext_dict">dic/hehe.dic;dic/haha.dic</entry>
         <!--Users can configure their own extended stop word dictionary here, such as ah ah invalid words -->
     <entry key="ext_stopwords">dic/stop.dic</entry>
        <!--Users can configure the remote extension dictionary and their own remote extension words here -->
    <entry key="remote_ext_dict">http://m.dic.cn/ext.txt</entry>
        <!--Users can configure the remote extended stop word dictionary here-->
        <!-- <entry key="remote_ext_stopwords">http://m.dic.cn/stop.txt</entry> -->
</properties>
  • Here you can simply take a look at chestnuts, such as mushrooms, blue and thin. Let's have a look first
  • Now let me configure the file and use the remote extended Thesaurus (note that if the address of the remote thesaurus is modified or added, it needs to be restarted. If the address does not change but is supplemented, it does not need to be restarted)
  • Take a look at my remote address (note that in order to prevent the browser from accessing txt garbled code, charset 'utf-8' is added to nginx server)

    My configuration file is as follows. A remote extension dictionary is configured
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer Extended configuration</comment>
        <!--Users can configure their own extended dictionary here -->
        <entry key="ext_dict"></entry>
         <!--Users can configure their own extended stop word dictionary here-->
        <entry key="ext_stopwords"></entry>
        <!--Users can configure the remote extension dictionary here -->
        <entry key="remote_ext_dict">https://0e2d-222-129-5-131.ngrok.io/ext.txt</entry>
        <!--Users can configure the remote extended stop word dictionary here-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

  • I added a word technology bug to the remote dictionary

main.dic

ik's native built-in Chinese Thesaurus has a total of more than 270000 words. As long as these words are divided together

quantifier.dic

Put some unit related words

suffix.dic

Put some suffixes

surname.dic

Chinese surname

stopword.dic

English stop words

Tags: Operation & Maintenance ElasticSearch Nginx IDE

Posted on Fri, 19 Nov 2021 09:49:55 -0500 by ben2.0