Introduction to Elasticsearch

outline

What is Elasticsearch?

Elasticsearch, a distributed, high-performance, highly available, scalable search and analysis system.

Applicable scenarios of Elasticsearch

  1. E-commerce website search
  2. Data analysis
  3. BI system
  4. Log analysis elk, etc

Lucene and Elasticsearch

  1. Lucene

Lucene is a sub project of the jakarta project team of apache Software Foundation. It is an open source full-text search engine toolkit. However, it is not a complete full-text search engine, but a full-text search engine architecture. It provides a complete query engine, index engine and some text analysis engines (English and German)

  1. Elasticsearch

Elastic search encapsulates Lucene and provides users with a simpler APi. Each Elasticsearch fragment is a Lucene instance

Basic concepts

  1. Cluster

Cluster contains multiple nodes. The cluster to which each node belongs is determined by a configuration (cluster name, elasticsearch by default). elasticsearch.yml file in config directory

  1. Node

Node, a node in the cluster, also has a name (randomly assigned by default). If you directly start a group of nodes, they will automatically form an elasticsearch cluster, and a node can also form an elasticsearch cluster

  1. Document&field:

Document is the smallest data unit in es. A document can be a piece of customer data, a piece of commodity classification data and an order data. It is usually represented by JSON data structure. Multiple documents can be stored in the type under each index. There are multiple fields in a document, and each field is a data field.

  1. Index

Index contains a pile of document data with similar structure. For example, there can be a customer index, commodity classification index and order index. The index has a name. An index contains many documents, and an index represents a kind of similar or identical documents. For example, create a product index, commodity index, which may store all commodity data and all commodity documents.

  1. Type

Type. Each index can have one or more types. Type is a logical data classification in the index. Documents under a type have the same field. For example, the blog system has an index that can define user data type, blog data type and comment data type. After 7.0, the concept of type was gradually abandoned: https://blog.csdn.net/zhanghongzheng3213/article/details/106281436/

Example:

commodity index,It stores all the commodity data and commodities document

But there are many kinds of goods, each kind of document of field It may be different. For example, electrical goods may also include some special items such as after-sales time range field;Fresh goods also include some special items such as fresh shelf life field

type,Daily chemical commodity type,Electrical goods type,Fresh goods type

Daily chemical commodity type: product_id,product_name,product_desc,category_id,category_name
 Electrical goods type: product_id,product_name,product_desc,category_id,category_name,service_period
 Fresh goods type: product_id,product_name,product_desc,category_id,category_name,eat_period

every last type Inside, there will be a pile document
  1. aliases

Alias. Index aliases can point to one or more indexes and can be used in any API that requires an index name. This function is very powerful, merging multiple indexes into a logical view.

  1. shard

A single machine cannot store a large amount of data. es you can divide the data in an index into multiple shards and store them on multiple servers. With shard, you can scale horizontally, store more data, distribute search and analysis operations to multiple servers, and improve throughput and performance. Each shard is a lucene index.

  1. replica

Any server may fail or go down at any time, and the shard may be lost. Therefore, you can create multiple replicas for each shard. Replicas can provide backup services in case of shard failure to ensure no data loss. Multiple replicas can also improve the throughput and performance of search operations. Primary shards (set once when creating an index and cannot be modified, 5 by default), replica shards (modify quantity at any time, 1 by default), 10 shards for each index, 5 primary shard s and 5 replica shards by default. The minimum highly available configuration is 2 servers.

Basic API

Cluster / index base API

  1. Quickly check the health of the cluster
GET /_cat/health?v

epoch      timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1637731073 05:17:53  my.elk  yellow          1         1     12  12    0    0        3             0                  -                 80.0%

// The status field is explained here. The status field identifies the health status of the cluster
green: Per index primary shard and replica shard All active Stateful
yellow: Per index primary shard All active State, but part replica shard no active State, in unavailable state
red: Not all indexes primary shard All active Status, some indexes have lost data
  1. View all indexes
GET /_cat/indices?v

health status index                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   test2                           MrmwSSNZQOyElfEZVAm0Bw   1   1          0            0       208b           208b
  1. Delete index
DELETE /test_index?pretty
  1. When creating an index, note that only the most basic index is created here, and some parameters are default, so this can be done in demo, but not in actual development
PUT /test_index?pretty

Simple crud
Note that there is an explanation of the meaning of the column in the code

  1. newly added
// id can be specified or not specified. If it is not specified, es will generate distributed id by default
// Es will automatically create index and type without creating them in advance, and es will create inverted index for each field of document by default so that it can be searched
PUT /index/type/id
{
    "attribute": "value"
}


// Example:
PUT /test/test/1
{
  "name":"xia",
  "age":26
}
// Return value 
{
  "_index" : "test",            // Indexes
  "_type" : "test",             // type
  "_id" : "1",                  // Unique document identifier 
  "_version" : 1,               // The concurrency control version number of the old version. Optimistic lock is used in es to realize concurrency control
  "result" : "created",         // created indicates that this document operation is new
  "_shards" : {                 // Slice information
    "total" : 2,                // Requests are sent to different partitions. Most partitions respond normally. These are successful. If a partition does not respond, it is failed
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,                // Concurrency control version number of the new version     
  "_primary_term" : 1           // Concurrency control version number of the new version     
}
  1. Retrieve document
GET /index/type/id

// Return value
{
  "_index" : "test",
  "_type" : "test",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 4,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {     // Document entity
    "name" : "xia",
    "age" : 26
  }
}


  1. Replace document
// The id already exists. Here, the entire document will be replaced in full. The es bottom layer will modify the historical document to deleted (logical deletion), and then add
PUT /index/type/id

PUT /test/test/1
{
  "name":"xia1",
  "age":28
}

// Return value
{
  "_index" : "test",
  "_type" : "test",
  "_id" : "1",
  "_version" : 2,           // Version number change
  "result" : "updated",     // This is a modification
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 1,            // Version number
  "_primary_term" : 1       // This is a piecemeal change
}

  1. Update document specified fields
POST /test/test/1/_update
{
  "doc": {
    "name": "xia2"
  }
}

// Return value
{
  "_index" : "test",
  "_type" : "test",
  "_id" : "1",
  "_version" : 3,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 2,
  "_primary_term" : 1
}
  1. remove document
DELETE /test/test/1

// Return value
{
  "_index" : "test",
  "_type" : "test",
  "_id" : "1",
  "_version" : 4,
  "result" : "deleted",   // Identify as delete
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 3,
  "_primary_term" : 1
}

query DSL

  1. Query all
GET /test/test/_search

or 

GET /test/test/_search
{
  "query": { "match_all": {} }
}

// Return value
{
  "took" : 1,                       // It took a few milliseconds
  "timed_out" : false,              // Timeout false no
  "_shards" : {                     // If there are several pieces of data, the search request will hit several primary shard s (or one of its replica shard s)
    "total" : 1,                    // total
    "successful" : 1,               // Successful response
    "skipped" : 0,                  // Skipped
    "failed" : 0                    // Unresponsive
  },
  "hits" : {
    "total" : {
      "value" : 1,                  // Number of query results
      "relation" : "eq"
    },
    "max_score" : 1.0,              // Score means the matching score of the document for the correlation degree of a search. The more relevant, the more matched, and the higher the score
    "hits" : [                      // Contains detailed data of document s matching the search
      {
        "_index" : "test",
        "_type" : "test",
        "_id" : "1",
        "_score" : 1.0,             // Matching values for the current document
        "_source" : {
          "name" : "xia",
          "age" : 26
        }
      }
    ]
  }
}
  1. Conditional query, sorting, pagination
GET /test/test/_search
{
    "query" : {
        "match" : {
            "name" : "xia"   // Match an attribute 
        }
    },
    "sort": [
        { "age": "desc" }   // sort
    ],
    "from":2,               // Paging initial offset
    "size":1                // Current offset of paging
}

// The return value is the same as above
  1. query filter
GET /test/test/_search
{
    "query" : {
        "bool" : {
            "must" : {
                "match" : {
                    "name" : "xia" 
                }
            },
            "filter" : {
                "range" : {
                    "age" : { "gt" : 25 } 
                }
            }
        }
    }
}

// and 
GET /test/test/_search
{
    "query":{
        "bool":{
            "must":[
                {
                    "term":{
                        "name":"xia"
                    }
                },
                {
                    "term":{
                        "age":"2"
                    }
                }
            ]
        }
    },
    "from":0,
    "size":10
}
  1. Full text search
GET /test/test/_search
{
    "query" : {
        "match" : {
            // First segment 'xia' and then match it with the data in the library
            "name": "xia"
        }
    }
}
  1. phrase search
GET /test/test/_search
{
    "query" : {
        "match_phrase" : {
            // As like as two peas, the required search string must be exactly the same in the specified field text, so that it can match.
            "name" : "xia"
        }
    }
}

  1. Custom return results
GET /test/test/_search
{
    "query" : {
        "bool" : {
            "must" : {
                "match" : {
                    "name" : "xia" 
                }
            },
            "filter" : {
                "range" : {
                    "age" : { "gt" : 25 } 
                }
            }
        }
    },
    "_source":["age"]  // Write the attribute you want to query here
}
  1. highlight search. Go to Baidu as your homework

Aggregation is a very powerful feature of Elasticsearch

  1. Simple aggregation note that text field aggregation requires setting the fielddata property of the text field to true
GET /test/test/_search
{
  "aggs": {                                         // Simple barrel separation
    "group_by_tags": {                              // Return value property of bucket Division
      "terms": { "field": "age" }                   
    }
  },
  
   "query" : {                                      // query criteria
        "match" : {
            "name": "xia"
        }
    },
  
  "from":"0","size":2                               // Here is the offset of the control return value hits
}

// Return value
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {                                        // Query results
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6931471,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "test",
        "_id" : "1",
        "_score" : 0.6931471,
        "_source" : {
          "name" : "xia",
          "age" : 26
        }
      }
    ]
  },
  "aggregations" : {                                // Aggregation results
    "group_by_tags" : {                             // The return value property of bucket division is defined in the request
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [                                 // Polymerized barrel
        {
          "key" : 26,                               // The key has been divided into buckets, which is the same as Mysql groupBy age
          "doc_count" : 1                           // Number of documents in the bucket
        }
      ]
    }
  }
}

  1. Group first, and then calculate the average value of each group
GET /test/test/_search
{
    "size": 0,
    "aggs" : {
        "group_by_tags" : {
            "terms" : { "field" : "name" },
            "aggs" : {
                "avg_price" : {
                    "avg" : { "field" : "age" }
                }
            }
        }
    }
}

// Return value
{
  "took" : 27,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_tags" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "xia",
          "doc_count" : 1,
          "avg_price" : {
            "value" : 26.0
          }
        }
      ]
    }
  }
}

3. Sorting

GET /test/test/_search
{
    "size": 0,
    "aggs" : {
        "group_by_tags" : {
             "terms" : { "field" : "name", "order": { "avg_price": "desc" } },
            "aggs" : {
                "avg_price" : {
                    "avg" : { "field" : "age" }
                }
            }
        }
    }
}

  1. Nested buckets and date histograms are supported in Elasticsearch.... If you are interested, you can refer to the official website for learning, https://www.elastic.co/guide/en/elasticsearch/reference/7.15/search-aggregations-bucket.html

Elasticsearch-SQL

sql has been supported in the official version of Elasticsearch. Although it has high support for sql, there are still many restrictions due to the limitations of NoSql itself, such as not supporting JOIN, JDBC connection charging, etc.

Simple use

  1. Simple query
GET /_sql?format=txt
{
  "query": """ SELECT * FROM "test" limit 10 """
}
// Return value 
age            |name      
---------------+---------------
26             |xia            


GET /_sql
{
  "query": """ SELECT * FROM "test" limit 10 """    // "sql statement" is OK, but only '' 
}
// Return value
{
  "columns" : [     ,              // Header
    {
      "name" : "age",               // attribute
      "type" : "long"               // type
    },
    {
      "name" : "name",
      "type" : "text"
    }
  ],
  "rows" : [
    [
      26,
      "xia"
    ]
  ]
}
  1. Convert SQL to DSL
GET /_sql/translate 
{
  "query": """
  SELECT * FROM "test" limit 10
  """
}
// Return value
{
  "size" : 10,
  "_source" : false,
  "fields" : [                          // Fields to query
    {
      "field" : "age"
    },
    {
      "field" : "name"
    }
  ],
  "sort" : [
    {
      "_doc" : {
        "order" : "asc"
      }
    }
  ]
}
  1. Mixed use
GET /_sql 
{
  "query": """
  SELECT * FROM "test" limit 10
  """,
  "filter":{
        "range": {
            "age": {
                "gte" : 20,
                "lte" : 35
            }
        }
    },
    "fetch_size": 10

}
// The return value is the same as above
  1. nested object
PUT /test_index/sql_type/1
{
  "name":"xia",
  "info":{
    "iphone":"1352468487"
  }
}

GET /_sql 
{
  "query": """
  SELECT name, info.iphone FROM "test_index" limit 10           // If you directly query *, an error will be reported
  """
}

// It is not friendly to array support, and no solution has been found for the time being

  1. grammar

SELECT select_expr [, ...]
[ FROM table_name ]
[ WHERE condition ]
[ GROUP BY grouping_element [, ...] ]
[ HAVING condition]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ LIMIT [ count ] ]
[ PIVOT ( aggregation_expr FOR column IN ( value [ [ AS ] alias ] [, ...] ) )]

script

languageSandboxRequired plug-insobjective
painlesssupportbuilt-inBuilt for Elasticsearch
expressionsupportbuilt-inQuickly customize ranking and sorting
mustachesupportbuilt-inTemplate
JavaWrite it yourselfAPI

introduce

By writing scripts, users can calculate user-defined expressions in Elasticsearch, so scripts are still one of the powerful tools of Elasticsearch when solving complex problems (user-defined score, user-defined text relevance, user-defined filtering, user-defined aggregation analysis). The following is explained by painless.

Painless is a simple, secure scripting language designed for use with Elasticsearch. It is Elasticsearch's default scripting language and can be safely used for inlining and storing scripts.

  1. Efficient: Painless is compiled directly into JVM bytecode to take advantage of all possible optimizations provided by the JVM. In addition, Painless typically avoids features that require additional slow checks at runtime.
  2. Strong security: white list is used to restrict the access of functions and fields, avoiding possible security risks.
  3. Optional inputs: variables and parameters can use explicit or dynamic def types.
  4. Simple:: painless implements a syntax that is naturally familiar to anyone with some basic coding experience. Painless uses a subset of Java syntax and makes some additional improvements to enhance readability and remove templates

use

GET test/_search
{
  "script_fields": {
    "my_doubled_field": {                                       // Returned property value
      "script": { 
        "source": "doc['age'].value * params['multiplier']",   // Enter the age field * into the parameter 'multiplier'
        "params": {
          "multiplier": 2  // Input parameters can be multiple input parameters
        }
      }
    }
  }
}

// Return value
{
  "took" : 30,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "test",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "my_doubled_field" : [    // result
            52
          ]
        }
      }
    ]
  }
}

// polymerization
1GET test/_search
{
    "aggs" : {
        "groups" : {
            "terms" : {
                "script" : {
                    "source": "doc['age'].value",
                    "lang": "painless"
                }
            }
        }
    }
}

Storage script

POST _scripts/calculate-score                   // Calculate score identifier, similar to method name
{
  "script": {
    "lang": "painless",
    "source": "Math.log(_score * 2) + params['my_modifier']"
  }
}

// obtain
GET _scripts/calculate-score

Custom rating

GET test/_search
{
  "query": {
    "script_score": {                       // Script rating
      "query": {
        "match": {
            "name": "xia"
        }
      },
      "script": {                           // Use script
        "id": "calculate-score",            // Stored script id
        "params": {                         // input parameter 
          "my_modifier": 2      
        }
      }
    }
  }
}
// Return value
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 2.3266342,    // here
    "hits" : [
      {
        "_index" : "test",
        "_type" : "test",
        "_id" : "1",
        "_score" : 2.3266342,  // result
        "_source" : {
          "name" : "xia",
          "age" : 26
        }
      }
    ]
  }
}

Why can it be used for search engines fast? Where is it

Elastic search uses a structure called inverted index, which is suitable for fast full-text search. An inverted index consists of a list of all non repeating words in the document. For each word, there is a document list containing it.

Inverted index

  1. What is an inverted index?

My understanding is to take the attribute value as an index and then point to a specific address. A common index like mysql is a reverse index. Find the key first and then the line record. Key can correspond to multiple line records, but the matching values are different.

  1. Detailed inverted index on official website

For example, suppose we have two documents, and the content field of each document contains the following contents:

The quick brown fox jumped over the lazy dog
Quick brown foxes leap over lazy dogs in summer

In order to create an inverted index, es first split the content field of each document into separate words (commonly referred to as entries or tokens), create a sorted list containing all non duplicate entries, and then list which document each entry appears in. The results are as follows:

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

Now, if we want to search quick brown, we just need to find the document containing each entry:

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
quick   |   X   |
------------------------
Total   |   2   |  1

Both documents match, but the first document matches more than the second. If we use a simple similarity algorithm that only calculates the number of matching terms, we can say that the first document is better than the second document for the relevance of our query.

Is that perfect? No, there are the following problems

  1. Quick and quick appear as separate terms, but users may think they are the same words.
  2. fox and foxes are very similar, like dog and dogs; They have the same root.
  3. Jump and leap, although they don't have the same root, they mean very similar. They are synonyms.

Using the previous index to search + Quick +fox will not get any matching documents. (remember, the + prefix indicates that the word must exist.) only documents with both Quick and fox meet this query condition, but the first document contains quick fox and the second document contains Quick foxes.

What should we do if we want to match two corresponding documents?

For example:

  1. Quick can be lowercase to quick.
  2. foxes can extract stems -- into the form of roots -- into fox es. Similarly, dogs can be extracted as dog.
  3. Jump and leap are synonyms and can be indexed as the same word jump.

So now the index is like this

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
dog     |   X   |  X
fox     |   X   |  X
in      |       |  X
jump    |   X   |  X
lazy    |   X   |  X
over    |   X   |  X
quick   |   X   |  X
summer  |       |  X
the     |   X   |  X
------------------------

In this way, in fact, our search for + Quick +fox will still fail, because there is no Quick in our index. Therefore, we need to use the same word segmentation method when searching input and when creating inverted index. The next writer will explain the word splitter.

  1. Word segmentation granularity in work

When the inverted index is established, the input phrase of fine-grained word segmentation search is segmented with coarse-grained word segmentation, which can effectively match the data to be matched.

Word splitter & parser & character filter & token filter

  1. What is a tokenizer

The code that decomposes a string into a single entry or vocabulary group. The standard word splitter used in the standard analyzer decomposes a string into a single entry according to the word boundary, and removes most punctuation. For example: I'm Chinese. I'm, Chinese and Chinese through the rules of word segmentation.

An parser contains a word splitter

2. Character filter

The character filter is used to sort out a string that has not been segmented.
For example, if our text is in HTML format, it will contain HTML tags such as < p > or < div > that we don't want to index.
We can use the HTML clear character filter to remove all HTML tags,
And like & Aacute; Convert to the corresponding Unicode character Á, so as to convert HTML entities.

A parser may have 0 or more character filters.

3. Token filter

After word segmentation, the resulting phrases will pass through the specified token filter in the specified order.
Token filters can modify, add, or remove phrase elements.

There are many token filters to choose from in elastic search. Stem filters contain words as stems.
ascii_ The folding filter removes diacritical notes and converts a word like "tr è s" to "tres".
ngram and edge_ The ngram word unit filter can generate word units suitable for partial matching or automatic completion.

  1. What is an analyzer

An unprocessed text field is standardized to form the phrase required for inverted index.

  1. Built in word splitter & parser & character filter & token filter.

analyzer

  • standard analyzer (default): standard analyzer divides text into word boundary conditions, such as defined by Unicode text segmentation algorithm. It removes most punctuation, lowercase terms, and supports the deletion of stop words.
  • Simple parser: a simple parser separates text wherever it is not a letter and lowercases entries.
  • Space Analyzer: the Space Analyzer divides text where spaces exist.
  • Whitespace analyzer: text is divided whenever it encounters any whitespace characters. It does not use lowercase terms.
  • Stop analyzer: stop analyzer and simple analyzer, but add support for removing stop word. It is used by default_ english_ Stop word.
  • Keyword analyzer: the analyzer is very simple. It just releases all the provided values. You can also specify the corresponding field as not_analyzed achieves the same goal.
  • Pattern analyzer: this analyzer allows flexible text partitioning using regular expressions.
  • Language analyzer: an analyzer designed for a specific language, such as english or french.

Tokenizer

  • Standard word splitter: divides text into word boundary conditions, which are defined by Unicode text segmentation algorithm. It removes most punctuation. Default participle
  • Lowercase word breaker: breaks text into entries when characters other than letters are encountered, but it also lowercases all entries.
  • Blank word splitter: perform word segmentation every time a space is encountered
  • n-gram word splitter: the ngram word splitter can decompose text into words. When it encounters a list of any specified characters (e.g., spaces or punctuation), it returns a sliding window of consecutive letters for each word of n-gram, such as quick → [qu, ui, ic, ck].

Elasticsearch has many built-in word splitters. Those interested can check it on the official website
https://www.elastic.co/guide/en/elasticsearch/reference/7.15/analysis-tokenizers.html

Character filter

  • HTML character filter: you can preprocess HTML symbols
  • Regular replacement character filter: preprocessing through regular expressions
  • Mapped character filter: the configured characters can be processed

Token filtering

  • Lower case: lower case the result text. For example, change the lazy dog to the lazy dog.
  • Delete duplicate words: delete duplicate words in the same position.

There are many built-in filters in Elasticsearch. Those interested can check it on the official website
https://www.elastic.co/guide/en/elasticsearch/reference/7.15/analysis-tokenfilters.html

  1. IK word breaker of custom word breaker

Query the data in elastic search and use the default word splitter. The effect of word segmentation is not ideal. The field will be divided into Chinese characters one by one, and the searched sentences will be segmented during search. It is very unwise, so the replacement product [IK word splitter] appears
gitHub address: https://github.com/medcl/elasticsearch-analysis-ik/releases

  1. Word segmentation core dictionary tree

Word lookup tree, Trie tree, is a tree structure and a variant of hash tree. The typical application is to count, sort and save a large number of strings (but not limited to strings), so it is often used by search engine system for text word frequency statistics. Its advantages are: using the common prefix of string to reduce the query time, minimizing unnecessary string comparison, and the query efficiency is higher than that of hash tree

Three properties of Trie tree:
The root node does not contain characters. Except for the root node, each node contains only one character
From the root node to a node, the characters passing through the path are connected to the corresponding string of the node
All child nodes of each node contain different characters

Analyzer workflow

  • First, the character filter filters and processes the analyzed text, such as removing HTML tags from the original text, replacing text according to character mapping, etc,

  • The filtered text is received by the word splitter, which divides the text into tag streams, that is, one tag after another,

  • Then, the token filter filters the tag stream, for example, removing the stop word, converting the word into its stem form, converting the word into its synonym, etc,

  • Finally, the filtered tag stream is stored in the inverted index;

  • When ElasticSearch engine receives a user's query request, it will use the analyzer to analyze the query conditions, reconstruct the query according to the analyzed structure, search the inverted index and complete the full-text search request,

Distributed execution search

In ElasticSearch, distributed storage and execution is also a way to improve efficiency. After all, the bottleneck of single point system exists objectively. So how does ElasticSearch perform distributed queries on data?

Execution in ElasticSearch is divided into two stages, which is called query then fetch in ElasticSearch

1. Query phase

The query phase consists of the following three steps:

  1. The client sends a search request to Node 3, and Node 3 will create an empty priority queue with the size of from + size.
  2. Node 3 forwards the query request to each primary partition or replica partition of the index. Each fragment executes the query locally and adds the result to the local ordered priority queue with the size of from + size.
  3. Each fragment returns the ID S and sorting values of all documents in its own priority queue to the coordination node, that is, Node 3. It combines these values into its own priority queue to generate a global sorted result list.

When a search request is sent to a node, the node becomes a coordination node. The task of this node is to broadcast query requests to all relevant fragments and integrate their responses into a globally sorted result set, which will be returned to the client.

The first step is to broadcast the request to each node in the index. Query requests can be processed by a master shard or a replica shard, which is why more replicas (when combined with more hardware) can increase the search throughput. The coordinating node will poll all sharded copies in subsequent requests to share the load.

Each fragment executes the query request locally and creates a priority queue with a length of from + size - that is, the result set created by each fragment is large enough to meet the global search request. Sharding returns a lightweight result list to the coordination node, which only contains the document ID set and any values needed for sorting, such as_ score .

The coordination node combines these fragment level results into its own ordered priority queue, which represents the global sorting result set. This completes the query process.

2. Retrieval stage

The retrieval phase consists of the following steps:

  1. The coordination node identifies which documents need to be retrieved and submits multiple GET requests to the relevant fragments.
  2. Each fragment loads and enriches the document, and then returns the document to the coordination node if necessary.
  3. Once all the documents are retrieved, the coordination node returns the results to the client.

The coordination node first determines which documents really need to be retrieved. For example, if our query specifies {"from": 90, "size": 10}, the first 90 results will be discarded, and only the 10 results starting from the 91st need to be retrieved. These documents may come from one, more, or even all tiles related to the original search request.

The coordination node creates a multi get request for each fragment holding relevant documents, and sends the request to the fragment copy in the same query processing stage.

Fragment loading document body --_ source field - if necessary, enrich the result document with metadata and search snippet highlighting. Once the coordination node receives all the result documents, it assembles the results and returns them to the client as a single response.

Of course, Elasticsearch is far more than that. I hope you can study more and see the official website

Official website address: https://www.elastic.co/cn/

Pay attention to my official account, collect massive learning materials, interview materials, and exchange technical solutions.

Tags: Java ElasticSearch Back-end

Posted on Mon, 29 Nov 2021 18:54:50 -0500 by sem_db