21 deep exploration search technology_ Using rescoring mechanism to optimize the performance of approximate matching search

The difference between match and phase match (proximity match)

Match -- > as long as a term is simply matched, it can be understood that the doc corresponding to the term is returned as the result, the inverted index is scanned, and the scan is ok

Phase match -- > first scan the doc list of all terms; Find the doc list containing all terms; Then, calculate the position of each term for each doc to see whether it meets the specified range; According to a given slop, complex operations are required to determine whether a doc can be matched by moving through the slop

The performance of match query is much higher than that of phrase match and proximity match (with slop). Because the latter two have to calculate the position distance.
The performance of match query is 10 times higher than that of phase match and 20 times higher than that of proximity match.

But don't worry too much, because the performance of es is generally at the millisecond level. The performance of match query is usually in a few milliseconds or tens of milliseconds, while the performance of phase match and proximity match is between tens of milliseconds and hundreds of milliseconds, so it is acceptable.

Optimizing the performance of proximity match is generally to reduce the number of document s to be searched by proximity match. The main idea is to use match query to filter out the required data, and then use proximity match to improve doc scores according to the term distance. At the same time, proximity match only works for the top n docs of each shard to readjust their scores. This process is called rescoring and re scoring. Because general users will query in pages and only see the data of the previous pages, there is no need to perform proximity match on all results.

As we just said, match + proximity match can improve the recall rate and accuracy at the same time

By default, match may match 1000 docs, and proximity match needs to perform an operation on each doc to determine whether it can slop the match, and then contribute its own score
However, in many cases, there may be 1000 docs from match. In fact, users query in pages in most cases, so they may only look at the first few pages at most. For example, there are 10 docs on one page, and there may be 5 docs at most, that is, 50 docs
proximity match only needs to slop the first 50 docs to match and contribute their own scores. It is not necessary to calculate and contribute scores to all 1000 docs

rescore: re score

match: 1000 docs. In fact, each doc has a score at this time;

proximity match, the first 50 docs, rescore and re score; The closer the first 50 doc and term examples, the higher the ranking

GET /forum/article/_search 
{
  "query": {
    "match": {
      "content": "java spark"
    }
  },
  "rescore": {
    "window_size": 50,
    "query": {
      "rescore_query": {
        "match_phrase": {
          "content": {
            "query": "java spark",
            "slop": 50
          }
        }
      }
    }
  }
}

Response results

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1.258609,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 1.258609,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2021-11-11",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
          "sub_title": "haha, hello world",
          "author_first_name": "Tonny",
          "author_last_name": "Peter Smith",
          "new_author_last_name": "Peter Smith",
          "new_author_first_name": "Tonny"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 0.68640786,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language",
          "sub_title": "learned a lot of course",
          "author_first_name": "Smith",
          "author_last_name": "Williams",
          "new_author_last_name": "Williams",
          "new_author_first_name": "Smith"
        }
      }
    ]
  }
}

Tags: Java ElasticSearch Cache

Posted on Sun, 21 Nov 2021 23:32:15 -0500 by punky79