catalogue
1.1 common paging methods: from+size
2, elasticsearch paging using java api
two point one Shallow paging from and size mode
1, Paging by command
1.1 common paging methods: from+size
The default paging mode of elastic search is from+size. However, in the case of deep paging, the efficiency of this method is very low. For example, from=5000,size=10. Es needs to match and sort on each partition to obtain 5000 * 10 valid data, and then take the last 10 data in the result set to return. In addition to efficiency problems, there is also an unsolvable problem: the maximum skip value supported by ES is max_result_window (the default is 10000), that is, when from+size > max_result_window, es will return an error.
Problem Description:
For example, when there is a problem with the es data on the customer's online, when the page is paged to hundreds of pages, the es cannot return the data. At this time, in order to restore normal use, we can adopt an emergency avoidance method, that is, max_ result_ Set the value of window to 50000.
Solution:
curl -XPUT "127.0.0.1:9200/custm/_settings" -d '{ "index" : { "max_result_window" : 50000 } }'
The above solution is only a temporary solution to the problem. When es is used more and more, the amount of data is larger and larger, and the scene of deep paging is more and more complex, another paging method, scroll, can be used.
1.2 scroll mode
In order to meet the scenario of deep paging, ES provides scroll mode for paging reading. The principle is: generate a cursor scroll for a query_ ID, subsequent queries only need to get data according to this cursor until the hits field returned in the result set is an empty list, which means that the traversal is over. Scroll is not used to query data in real time. Because it makes multiple requests to es, it is impossible to query in real time; Its main function is to query a large amount of data or all data.
Using scroll, you can only get the contents of one page at a time, and then one scroll will be returned_ id. According to the returned scroll_id can constantly get the content of the next page, so scroll is not suitable for the scenario with page skipping
The process of deep paging reading using curl is as follows:
1. Get the first scroll_id and url parameters include / index/type and scroll. The scroll field specifies scroll_ The valid lifetime of ID will be es automatically cleared after expiration.
curl -H "Content-Type: application/json" -XGET '192.168.200.100:9200/chuyun/_search?pretty&scroll=2m' -d' {"query":{"match_all":{}}, "sort": ["_doc"]}'
2. During the traversal, it is based on the data in the last traversal_ scroll_id, with the scroll parameter, repeat the previous traversal step until the returned data is an empty list, indicating that the traversal is complete.
Note: you should pass the parameter scroll every time to refresh the cache time of search results; You do not need to specify index and type (do not set the cache time too long and occupy memory) for subsequent queries.
curl -H "Content-Type: application/json" -XGET '192.168.200.100:9200/_search/scroll?pretty' -d' { "scroll" : "2m", "scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAABWFm43cDd3eERJVHNHMHJzSlNkajdPUHcAAAAAAAAAVxZuN3A3d3hESVRzRzByc0pTZGo3T1B3AAAAAAAAAFsWazlvUFptQnNTdXlmNmZRTl80cVdCdwAAAAAAAABVFm43cDd3eERJVHNHMHJzSlNkajdPUHcAAAAAAAAAWhZrOW9QWm1Cc1N1eWY2ZlFOXzRxV0J3" }'
3. Deletion of scroll
Delete all scrolls_ id:
curl -XDELETE 192.168.200.100:9200/_search/scroll/_all
Deletes the specified scroll_id:
curl -XDELETE 192.168.200.100:9200/_search/scroll -d '{"scroll_id" : ["cXVlcnlBbmRGZXRjaDsxOzg3OTA4NDpTQzRmWWkwQ1Q1bUlwMjc0WmdIX2ZnOzA7"]}'
1.3 search_after mode
Note: use search_after, you must set from=0. Get the "value of sort attribute in the last data returned from the last query" and assign it to search_after attribute. Here, I use_ id is sorted as a unique value.
The scroll method is officially not recommended for real-time requests (generally used for data export). Because every scroll_id not only takes up a lot of resources, but also generates a historical snapshot. Changes to data will not be reflected in the snapshot.
search_ The after paging method is to determine the position of the next page according to the last data of the previous page. At the same time, in the process of paging request, if there is addition, deletion, modification and query of index data, these changes will also be reflected on the cursor in real time. However, it should be noted that because the data of each page depends on the last data of the previous page, it is impossible to skip the page request.
In order to find the last piece of data on each page, each document must have a globally unique value, which is officially recommended_ uuid is the globally unique value. Of course, the business id can also be used.
For example, in the following example, I first reverse the order by id:
curl -H "Content-Type: application/json" -XGET '192.168.200.100:9200/chuyun/_search?pretty' -d' { "size": 2, "from": 0, "sort": [ { "_id": { "order": "desc" } } ] }'
The query result is:
{ "took" : 7, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : null, "hits" : [ { "_index" : "chuyun", "_type" : "article", "_id" : "3", "_score" : null, "_source" : { "id" : 3, "title" : "<Qing Yu An·New year's Eve", "content" : "The east wind puts flowers and thousands of trees at night, and the stars blow down like rain. BMW carved cars all over the road. The sound of the Phoenix flute moves, the light of the jade pot turns, and the fish and dragon dance all night. Moths, snow willows and gold wisps, smile and faint fragrance. The crowd looked for him thousands of times. Suddenly looking back, the man was in the dim light.", "viewCount" : 786, "createTime" : 1557471088252, "updateTime" : 1557471088252 }, "sort" : [ "3" ] }, { "_index" : "chuyun", "_type" : "article", "_id" : "2", "_score" : null, "_source" : { "id" : 2, "title" : "<Butterflies love flowers", "content" : "Standing on the dangerous building, the wind is thin, looking at the extreme spring sorrow, and the sky is dark. Grass color, smoke and light, speechless, who will stop meaning. It is planned to get drunk and sing when drinking. Strong music is tasteless. my clothes grow daily more loose, yet care I not. For you am I thus wasting away in sorrow and pain.", "viewCount" : null, "createTime" : 1557471087998, "updateTime" : 1557471087998 }, "sort" : [ "2" ] } ] } }
Will search_after is assigned as "return value of sort during last query", and search the next page:
curl -H "Content-Type: application/json" -XGET '192.168.200.100:9200/chuyun/_search?pretty' -d' { "size": 2, "from": 0, "search_after": [ 2 ], "sort": [ { "_id": { "order": "desc" } } ] }'
The query result is:
{ "took" : 12, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : null, "hits" : [ { "_index" : "chuyun", "_type" : "article", "_id" : "1", "_score" : null, "_source" : { "id" : 1, "title" : "<Butterflies love flowers", "content" : "The threshold chrysanthemum is worried about the smoke, the orchid weeps and dew, the Luo curtain is light and cold, and the swallows fly away. The bright moon does not know the bitterness of leaving, and the oblique light wears Zhu Hu at dawn. Last night, the west wind withered the green trees and went up the tall buildings alone, looking at the end of the world. If you want to send color paper and ruler, where do you know?", "viewCount" : 678, "createTime" : 1557471087754, "updateTime" : 1557471087754 }, "sort" : [ "1" ] } ] } }
2, elasticsearch paging using java api
According to the general query process, for example, I want to find the first 10 data:
1. A client request is sent to a node.
2. The node forwards the request to each partition and queries the first 10 data on each partition.
3. The results are returned to the node, the data is integrated, and the first 10 items are extracted.
4. Return to the requesting client.
When I want to query items 10 to 20, I need to use paging query.
Tools:
** * structure elasticsrarch client */ public class LowClientUtil { private static TransportClient client; public TransportClient CreateClient() throws Exception { // Build the client first System.out.println("11111111111"); Settings settings=Settings.builder() .put("cluster.name","elasticsearch1") .put("client.transport.ignore_cluster_name", true) //If the cluster name is incorrect, you can also connect .build(); //Create Client TransportClient client = new PreBuiltTransportClient(settings) .addTransportAddress( new TransportAddress( InetAddress.getByName( "192.168.200.100"), 9300)); return client; } }
Prepare data:
/** * Prepare data * @throws Exception */ public static void createDocument100() throws Exception { for (int i = 1; i <= 100; i++) { try { HashMap<String, Object> map = new HashMap<>(); map.put("title", "The first" + i + "This book"); map.put("author", "author" + i); map.put("id", i); map.put("message", i + "It's Stephen, a British physicist·Hawking's scientific work was first published in 1988. whole book"); IndexResponse response = client.prepareIndex("blog2", "article") .setSource(map) .get(); // Index name String _index = response.getIndex(); // type String _type = response.getType(); // Document ID String _id = response.getId(); // edition long _version = response.getVersion(); // Operation status returned RestStatus status = response.status(); System.out.println("Index name:" + _index + " " + "type :" + _type + " file ID: " + _id + " edition:" + _version + " Operation status returned:" + status ); } catch (Exception e) { e.printStackTrace(); } } }
two point one Shallow paging from and size mode
Query a batch of data and return page by page starting from Article 10, and return 10 items per page.
/** * from-size searchRequestBuilder The setFrom [start from 0] and setSize [how many records to query] methods are implemented * */ public static void sortPages(){ // Search data SearchRequestBuilder searchRequestBuilder = client.prepareSearch("blog2").setTypes("article") .setQuery(QueryBuilders.matchAllQuery());//By default, there are 10 records per page final long totalHits = searchRequestBuilder.get().getHits().getTotalHits();//Total number final int pageDocument = 10 ;//How many are displayed per page final long totalPage = totalHits / pageDocument;//How many pages are there altogether for(int i=1;i<=totalPage;i++){ System.out.println("=====================Currently printing page:"+i+" page=============="); //setFrom(): the first item to retrieve. The default value is 0. //setSize(): how many documents to query. searchRequestBuilder.setFrom(i*pageDocument).setSize(pageDocument); SearchResponse searchResponse = searchRequestBuilder.get(); SearchHits hits = searchResponse.getHits(); Iterator<SearchHit> iterator = hits.iterator(); while (iterator.hasNext()) { SearchHit searchHit = iterator.next(); // Each query object System.out.println(searchHit.getSourceAsString()); // Get string format print } } }
2.2 deep paging using scroll
For the from size mode, when elastic search responds to a request, it must determine the order of docs and arrange the response results.
If the number of pages requested is small (assuming 20 docs per page), Elasticsearch will not have performance problems. However, if the number of pages is large, such as requesting page 20, Elasticsearch has to take out all the docs from page 1 to page 20, and then remove the docs from page 1 to page 19 to get the docs on page 20, which is very performance-consuming.
The solution is to use scroll. Scroll maintains a snapshot information of the current index segment (this snapshot information is the snapshot when you execute the scroll query). After this query, any new index data will not be queried in this snapshot. However, compared with from and size, it does not "query all data and eliminate unwanted parts", but "record a reading location to ensure fast reading next time".
Can put scroll is divided into two steps: initialization and traversal:
1. During initialization, all search results that meet the search criteria are cached as snapshots.
2. When traversing, take data from this snapshot. In other words, after initialization, the index insertion, deletion and update data will not affect the traversal results.
public static void scrollPages(){ //Get the Client object, set the index name, search type (SearchType.SCAN)[5.4 remove, for java code, directly return the index order without sorting the results], search quantity, and send the request SearchResponse searchResponse = client .prepareSearch("blog2") .setSearchType(SearchType.DEFAULT)//Category to perform retrieval .setSize(10).setScroll(new TimeValue(1000)).execute() .actionGet();//Note: the first search does not contain data //Get total quantity long totalCount=searchResponse.getHits().getTotalHits(); int page=(int)totalCount/(10);//Calculate total pages System.out.println("Total pages: ================="+page+"============="); for (int i = 1; i <= page; i++) { System.out.println("=========================the number of pages:"+i+"=================="); searchResponse = client .prepareSearchScroll(searchResponse.getScrollId())//Send the request again and use the ScrollId of the last search result .setScroll(new TimeValue(1000)).execute() .actionGet(); SearchHits hits = searchResponse.getHits(); for(SearchHit searchHit : hits){ System.out.println(searchHit.getSourceAsString());// Get string format print } } }