elasticsearch data batch query and backup scheme

1. mget and size

es returns ten pieces of data by default for each query result, or you can set more by size

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "entname": "Huawei Technology Co., Ltd"
          }
        }
      ]
    }
  },
  "size":20
}

When the size setting is greater than 10000, the query will return an exception:

Error prompt: "reason": "Result window is too large, from + size must be less than or equal to: [10000] However, this option can also be modified in es configuration, but it is not recommended to modify this option.

mget can be used when multiple query conditions correspond to multiple or single query results, such as querying data with id 1 and 2

GET /_mget
{
   "docs" : [
      {
         "_index" : "test_index",
         "_type" :  "test_type",
         "_id" :    1
      },
      {
         "_index" : "test_index",
         "_type" :  "test_type",
         "_id" :    2
      }
   ]
}

MultiGetRequest is used in java to set query parameters. By the way, es query can also specify collection range query, such as the construction method of QueryBuilders.termsQuery("showtemp", "1","2","3"...). Corresponding query statement:

{
  "query": {
    "bool": {
      "must": [
        {
          "terms": {
            "showTemp": [
              "1",
              "2"
            ]
          }
        }
      ]
    }
  }
}

bulk API

In java, BulkRequestBuilder is used to construct. You can see how to use it. This is for constructing multiple query conditions. Of course, you can also do batch insertion.

scrollId

The bulk and mget mentioned above are only for the batch of query conditions, and size is for the expansion of single returned results. If you want to really put all Buy second-hand If the result is found, you can select the scolid cursor query. java example:

/**
     * Batch cursor query by time range
     *
     * @param index
     * @param type
     * @param dateKey   es Store update time field name
     * @param startDate Start time
     * @param endDate   End time
     * @return
     */
    public List<Map<String, Object>> searchByDate(String index, String type, String dateKey, String startDate, String endDate) throws IOException {
        List<Map<String, Object>> list = new ArrayList<>();
        RestHighLevelClient client = EsService.getClient();
        final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(1L));
        SearchRequest searchRequest = new SearchRequest(index);
        searchRequest.scroll(scroll);
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.size(2000);
        searchSourceBuilder.query(QueryBuilders.rangeQuery(dateKey).gte(startDate).lt(endDate));
        searchRequest.source(searchSourceBuilder);
        SearchResponse searchResponse = client.search(searchRequest);
        String scrollId = searchResponse.getScrollId();
        SearchHit[] searchHits = searchResponse.getHits().getHits();
        while (searchHits != null && searchHits.length > 0) {
            for (int i = 0; i < searchHits.length; i++) {
                list.add(searchHits[i].getSourceAsMap());
            }
            SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
            scrollRequest.scroll(scroll);
            searchResponse = client.searchScroll(scrollRequest);
            scrollId = searchResponse.getScrollId();
            searchHits = searchResponse.getHits().getHits();
        }
        //Once scrolling is complete, clear the scrolling context
        ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
        clearScrollRequest.addScrollId(scrollId);
        ClearScrollResponse clearScrollResponse = client.clearScroll(clearScrollRequest);
        boolean succeeded = clearScrollResponse.isSucceeded();
        return list;
    }

snapshot

If you need to back up es data, you can choose snapshot. If you set automatic snapshot for some index es, you need to restart es to take effect.

reindex

Similarly, reindex can also be used for ES data backup and migration, but this is the migration from ES to es. If you need to backup to files, you can choose the following scheme.

esdump

install

npm install elasticdump

There are many installation methods, and even support docker installation and operation. The specific scheme you want to use can be selected according to your own situation. The use methods are as follows:

'#Copy analyzer word segmentation
elasticdump \
  --input=http://ip1:9200/my_index \
  --output=http://ip2:9200/my_index \
  --type=analyzer
'#Copy mapping
elasticdump \
  --input=http://ip1.com:9200/my_index \
  --output=http://ip2:9200/my_index \
  --type=mapping
'#Copy data
elasticdump \
  --input=http://ip1:9200/my_index \
  --output=http://ip2:9200/my_index \
  --type=data

The above is es to es, - out can also point to files.

logstash

People who know ELK naturally know what logstash is used for. It is generally used to output log files to es for statistics and viewing. Is it OK to turn around? Of course, it's no problem. You can also export es data to files or even other databases.

The installation of logstash is relatively simple. Here is an example of exporting es data to hdfs using logstash:

input {
    elasticsearch {
        hosts => "es:9200"
        index => "food_business_license"
        size => 10000
        query => '
{

  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "PROVINCE": {
              "value": "Tibet"
            }
          }
        }
      ]
    }
  }
}
'
        scroll => "5m"
        docinfo => true
    }
}
filter {
    if![hh]{
       mutate {
           add_field => {
               "hh" => "ha-ha"
           }
       }
    }
}

output {
     webhdfs {
    host => "ip"
    port => "port"
    user => spark
    flush_size => 5000
    idle_flush_time => 10
    path => "/tmp/%{+YYYY}-%{+MM}-%{+dd}/food-%{+YYYY}%{+MM}%{+dd}.csv"
    codec => line {
                 format => "%{LEGAL_PERSON}\u0001%{TAXPAYER_NAME}\u0001%{hh}\u0001%{LEGAL_PERSON}"
             }
    }
}

Input is the input data, which is read from ES, so the written connection es and filter deal with some data formats and field conversion. Output is configured with output location and output format.

The principle of esdump and logstash batch export is also based on scroll query. I tested it and felt that it was faster to use logstash.

 

Tags: Java Big Data ElasticSearch

Posted on Thu, 04 Nov 2021 05:30:24 -0400 by SuprSpy79