es data write, delete and update

Reproduced at:

 

Data writing process:


 
Write process

Note: one data will not be written to multiple primary partitions


 
Underlying logic

Note: the data is written to the Buffer first and to the Translog at the same time (for data recovery in extreme cases). When the Buffer cache data reaches the threshold, it will be brushed to the disk in batches (there is a file system Buffer in the middle). Therefore, the data writing of es is a near real-time (the storage delay is 1 second by default)

Single write put/post:

1. put, you need to set the data ID (the first insertion of the same data is created, and the second insertion will be updated)
2. post, you can choose to set the data id (if you do not specify the id: the first insertion of the same data is created, the second insertion is created, but the _idwill change; if the id remains unchanged, the second insertion fails)
_ doc: the first insertion of the same data is created, and the second insertion will be updated
_ create: the first insertion of the same data is created, and an error will be reported if it is inserted again

PUT compyan-starff-001/_doc/1
{
  ...
}
PUT compyan-starff-001/_create/1
{
  ...
}

result, created/updated, new or updated

req request parameter interpretation:

op_type, operation type
Refresh, refresh policy

PUT compyan-starff-001/_doc/1?routing=1 #Route to the first primary partition
{
  ...
}

Routing, routing policy
Wait_for_active_shards, number of slices waiting to be written

GET company-staff-001/_doc/1  #In this way, you can GET the inserted data immediately, because this GET method will directly
#Return data from Buffer (real time)
GET company-staff-001/_search?routing=1  #_ search needs to route to the specified segment to get data from the disk (this is non real time)
#When putting data, you can force real-time (also near real-time) through & Refresh = true

Batch write_ bulk:

POST _bulk
{index:{"_index":"company-staff-001", "_id": "1"}}
...
{index:{"_index":"company-staff-001", "_id": "2"}}
...
{index:{"_index":"company-staff-001", "_id": "3"}}
...

Data deletion:

The deleted routing mechanism is the same as the written routing mechanism


 
Delete internal mechanism
DELETE company-staff-001/_doc/1
#Such as when inserting data_ Version is 1, and the deleted version number is_ The version is 2. The reason: the data version is updated before the tag is deleted. By default, the edited and deleted data can be saved for 24 hours

Condition deletion:_ delete_by_query

POST kibana_sample_data_logs_delete/_delete_by_query?requests_per_second
{
  "slice":{   #Manual fragment deletion
    "id":1,    #The id needs to be modified for two deletions
    "max":2 #Delete in two batches
  }
  "query": { 
    "match_all": {}
  }
}
POST kibana_sample_data_logs_delete/_delete_by_query?slice=2&scroll_size=1000&requests_per_second=100    
      #Automatic slice deletion. The maximum number of slices cannot exceed your own number of slices

requests_per_second: how many pieces of data are deleted per second
Note: generally, requests should be added_ per_ Second to control deletion; If not, it may take a long time to delete massive data conditionally, resulting in huge data loss, which is a dangerous operation
scroll_size: how many pieces of data are stored in the Buffer from es each time
requests_per_second: how many messages per second

GET _cat/tasks  #Query all tasks
GET _tasks?detailed=true&actions=*/delete/byquery  #Query the delete task above

Batch delete:

POST _bulk?refresh=true  #Refresh
{"delete":{"_index":"company-staff-001", "_id":"1"}}
{"delete":{"_index":"company-staff-001", "_id":"2"}}
{"delete":{"_index":"company-staff-001", "_id":"3"}}

You can use logstash to batch import data from mysql database to es. The principle is to use the "_bulk" command

Thinking: batch output data. When the amount of data is too large, it's better to directly delete the index and re import the data

to update

1. Full update

#The second put will be fully updated
POST compyan-starff-001/_doc/1
{
  ...
}

2. Local update

POST compyan-starff-001/_update/1
{
  "_doc":{
    "companyName":"xx company"
  }
#Important: false when the record with id 1 does not exist, an error will be updated. true when the record with id 1 does not exist, the index will be created and the record will be inserted
"doc_as_upset":true
}

3. Script update

Update records only companyName Field content
POST compyan-starff-001/_update/1?refresh=true
{
  "script":{
    "source":"""
    ctx._source. companyName="aaa company";
    """,
    "lang":"painless",
    "params":{}
  }
}
or
POST compyan-starff-001/_update/1?refresh=true
{
  "script":{
    "source":"""
    ctx._source. companyName=params.cName;
    """,
    "lang":"painless",
    "params":{
    "cName":"aaa company"
    }
  }
}
or
POST compyan-starff-001/_update/1?refresh=true
{
  "upsert":{
    "companyName"="aaa company"
    }
}

Batch update_ bulk:

POST _bulk?refresh=true
{"upadte":{"_index":"compyan-starff-001","_id":"1"}}
{"_doc":{"compaynId":"2020","userId":"1234"}}
{"upadte":{"_index":"compyan-starff-001","_id":"2"}}
{"_doc":{"compaynId":"2021","userId":"12345"}}
{"upadte":{"_index":"compyan-starff-001","_id":"3"}}
{"_doc":{"compaynId":"2022","userId":"123456"}}

"result":"noop" means that the data before and after follow-up is the same, and ES does not do the operation, (es will check first)
"_version" when the update is unsuccessful, the version number will still increase by 1. To specify the version number, you can set it externally, but the version number must be larger than the current version number, otherwise an error will be reported

Condition update:

"_update_by_query" is similar to conditional deletion



Author: struggling leek Wang
Link: https://www.jianshu.com/p/d9e5451456e6
Source: Jianshu
The copyright belongs to the author. For commercial reprint, please contact the author for authorization, and for non-commercial reprint, please indicate the source.

Posted on Thu, 04 Nov 2021 02:44:58 -0400 by simonmlewis