28 in depth illustration of paging search and deep paging performance problems

1. Explains how to use es for paging search syntax

From: from which corner mark

size: how many pieces of data to query

GET /_search?size=10
GET /_search?size=10&from=0
GET /_search?size=10&from=20

paging

GET /test_index/test_type/_search

Response results

{
  "took": 27,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 9,
    "max_score": 1,
    "hits": [
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "8",
        "_score": 1,
        "_source": {
          "test_field": "test client 2"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "10",
        "_score": 1,
        "_source": {
          "test_field1": "test1",
          "test_field2": "updated test2"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "12",
        "_score": 1,
        "_source": {
          "test_field": "test12"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "4",
        "_score": 1,
        "_source": {
          "test_field1": "test field111111"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "6",
        "_score": 1,
        "_source": {
          "test_field": "test test"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "test_field": "replaced test2"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "7",
        "_score": 1,
        "_source": {
          "test_field": "test client 2"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "test_field1": "test field1",
          "test_field2": "bulk test1"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "11",
        "_score": 1,
        "_source": {
          "num": 1,
          "tags": []
        }
      }
    ]
  }
}

Let's assume that the 9 pieces of data are divided into 3 pages, and each page is 3 pieces of data to experiment with the effect of this paging search

GET /test_index/test_type/_search?from=0&size=3

Response results

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 9,
    "max_score": 1,
    "hits": [
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "8",
        "_score": 1,
        "_source": {
          "test_field": "test client 2"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "10",
        "_score": 1,
        "_source": {
          "test_field1": "test1",
          "test_field2": "updated test2"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "12",
        "_score": 1,
        "_source": {
          "test_field": "test12"
        }
      }
    ]
  }
}

Page 1: id=8,10,12

GET /test_index/test_type/_search?from=3&size=3

Response results

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 9,
    "max_score": 1,
    "hits": [
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "4",
        "_score": 1,
        "_source": {
          "test_field1": "test field111111"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "6",
        "_score": 1,
        "_source": {
          "test_field": "test test"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "test_field": "replaced test2"
        }
      }
    ]
  }
}

Page 2: id=4,6,2

GET /test_index/test_type/_search?from=6&size=3

Response results

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 9,
    "max_score": 1,
    "hits": [
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "7",
        "_score": 1,
        "_source": {
          "test_field": "test client 2"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "test_field1": "test field1",
          "test_field2": "bulk test1"
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "11",
        "_score": 1,
        "_source": {
          "num": 1,
          "tags": []
        }
      }
    ]
  }
}

Page 3: id=7,1,11

2. What is a deep paging problem? Why does this problem arise and what is its underlying principle?

deep paging performance issues, as well as the principle of in-depth graphic disclosure, very advanced knowledge points

What is deep paging

In short, the search is very deep. For example, there are 60000 pieces of data in total, 20000 pieces of data are divided on each shard, and 10 pieces of data are divided into 2000 pages.

At this time, if you want to search page 1000, you need to think about the number of items to. In fact, you need to get 10001 ~ 10010 data

Does each shard return 10001 ~ 10010 data? Not at all!

Your request may reach the node of a shard that does not contain the index. Then this node is a coordinate node. Then this coordinate node will forward the search request to the node of the three shards of the index.

In this case, to search page 1000 of 60000 data, in fact, each shard needs to take out 10001 ~ 10010 of the internal 20000 data. Here is not these 10 data, but 10010 data. Therefore, the three shards return a total of 30030 data to the coordinate node, which will sort these data according to the correlation score_ socre sort, and then take out the ten pieces of data on page 1000.

When the search is too deep, you need to save a large amount of data on the coordinate node and sort a large amount of data. After sorting, take out the corresponding page of data and return. This process consumes not only network bandwidth, memory, but also CPU. Therefore, we should try our best to avoid the performance problem of deep paging

Tags: ElasticSearch

Posted on Sat, 06 Nov 2021 13:38:04 -0400 by mikeT