Chapter 13 of Elasticsearch source code analysis and Optimization Practice: Snapshot module analysis

brief introduction

Snapshot module is an important means for ES to backup and migrate data. It supports incremental backup and multiple types of warehouse storage. In this chapter, we first look at how to use snapshots and some of its detailed features, and then analyze the implementation principle of creating, deleting and canceling snapshots.

The warehouse is used to store snapshots, support shared file systems (for example, NFS), and HDFS, Amazon S3, Microsoft Azure and Google GCS supported by plug-ins.

In terms of cross version support, snapshots and restores that do not span large versions can be supported.

  • Snapshots created in version 6.x can be restored to version 6.x;
  • Snapshots created in version 2.x can be restored to version 5.x;
  • Snapshots created in version 1.x can be restored to version 2.x.

Conversely, snapshots created in version 1.x cannot be restored to versions 5.x and 6.0, and snapshots created in version 2.x cannot be restored to version 6.x. Before upgrading the cluster, it is recommended to back up data through snapshots. For data migration across large versions, consider using the reindex API.

When you need to migrate data, you can restore the snapshot to another cluster. Snapshots can not only back up indexes, but also save templates together. The recovered target cluster does not need the same node size, as long as its storage space is enough to accommodate these data.

To use snapshots, you should first register the repository. Snapshots are stored in the repository.

Warehouse

The warehouse is used to store snapshots created. It is recommended that you create a separate snapshot repository for each large version. If multiple clusters are used to register the same snapshot repository, it is best that only one cluster writes to the repository. Other clusters connected to the repository should set the repository to read only mode.

Register a warehouse using the following command:

curl -X PUT " localhost: 9200/_ snapshot/my_ backup" -H ' Content-Type: application/json' -d'
{
    "type": "fs",
    "settings": {
        "location": "/mnt/my_backup"
    }
}

In this example, the registered warehouse name is my_ Backup, the type is fs, and the specified warehouse type is shared file system. The configurations supported by the shared file system are shown in the following table.

parameterbrief introduction
locationA mounted destination address was specified
compressWhether to turn on compression. Compression only compresses metadata (mapping and settings), not data files. The default value is true
chunk_ sizeWhen transferring files, the data is decomposed into blocks. The block size is configured here, and the unit is bytes. The default is null (infinite block size)
max_snapshot_bytes_per_secThe speed limit between points during snapshot operation is 40MB by default
max_restore_bytes_per_ secThe speed limit between points during snapshot recovery is 40MB by default
readonlySet the warehouse property to read-only, and the default value is false

To obtain the configuration information of a warehouse, you can use the following API:

curl -X GET "localhost:9200/_snapshot/my_backup"

The returned information is as follows:

{
"my_ backup": {
"type": "fs",
"settings": {
"location": "/mnt/my_backup"
}
}
}

To get information about multiple repositories, you can specify a comma separated list of repositories, and you can use the "*" wildcard when specifying repository names. For example:

curl -X GET "localhost:9200/_snapshot/repo*,*backup*"

To obtain the information of all current warehouses, you can omit the warehouse name and use_ all:

curl -X GET "localhost:9200/_snapshot"
or
curl -X GET "localhost:9200/_snapshot/_all"

You can delete a snapshot from the warehouse using the following command:

curl -X DELETE "localhost:9200/_snapshot/my_backup/snapshot_1"

You can delete the entire warehouse using the following command:

curl -X DELETE "localhost:9200/_snapshot/my_backup"

When the warehouse is deleted, ES only deletes the warehouse location reference information of the snapshot, and the snapshot itself is not deleted.

Shared file system

When using a shared file system, you need to mount the same shared storage to the same mount point (path) of each node of the cluster, including all data nodes and primary nodes. Then configure the mount point to the path.repo field of elasticsearch. YML. For example, the mount point is / mnt/my_backup, the following configuration should be added in elasticsearch.yml:

path.repo: ["/mnt/my_backups"]

The path.repo configuration supports multiple values in the form of an array. If you configure multiple values, instead of using these paths at the same time as path.data, you should register a different repository for each mount point. For example, the storage space of one mount point is not enough to accommodate all the data of the cluster. You can use multiple mount points and register multiple warehouses at the same time to separately snapshot the data to different warehouses.

path.repo supports Microsoft's UNC path. The configuration format is as follows:

path.repo: ["MY_SERVERSnapshots"]

When the configuration is completed, you need to restart all nodes to make it effective. Then you can register the warehouse through the warehouse API and execute the snapshot.

The advantage of using shared storage is good cross version compatibility and suitable for data migration. The disadvantage is that the storage space is small. If HDFS is used, it is limited to the HDFS version used by the plug-in. The plug-in version should match ES, and the matching plug-in uses a fixed version of HDFS client. An HDFS client only supports writing to certain compatible versions of HDFS clusters.

snapshot

Create Snapshot

A repository can contain multiple snapshots of the same cluster. Each snapshot has a unique name identification. Use the following command in my_ Create a snapshot for all indexes in the backup repository_ Snapshot of 1:

curl -X PUT "localhost:9200/_snapshot/my_backup/snapshot_1?wait_for_completion=true"

**wait_ for_ The completion * * parameter is optional. By default, the snapshot command will return immediately and the task will be executed in the background. If you want to wait for the API to return after the task is completed, you can set wait_ for_ The completion parameter is set to true, and the default is false.

The above command will create a snapshot for all indexes in the open state. If you want to take a snapshot of some indexes, you can specify in the indexes parameter of the request:

curl -X PUT "localhost:9200/_snapshot/my_backup/snapshot_2?wait_for_completion=true" -H 'Content-Type: application/json' -d'
{
    "indices": "index_1, index_2",
    "ignore_unavailable": true,
    "include_global_state": true
}

The indexes field supports multiple index method, index_* Complete syntax reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/ multi-index.html

Two other parameters:

  • **ignore_unavailable * *, skip indexes that do not exist. The default is false, so by default, an index snapshot that does not exist fails.
  • **include_global_state * *, does not snapshot the cluster state. The default is false. Note that cluster settings and templates are saved in the cluster state. Therefore, cluster settings and templates are not snapshot by default, but generally we need to save these information together.

The snapshot operation is performed on the primary partition. During snapshot execution, normal read and write operations of the cluster are not affected. Before the snapshot starts, flush will be executed once to brush the data of the "cache" in the operating system memory. Therefore, the Lucene data stored in the disk from the time point when the snapshot is successfully executed can be obtained through the snapshot, excluding subsequent new content. However, each snapshot process is incremental, and the next snapshot will only contain new content.

You can create a snapshot process for the cluster at any time, whether the cluster health is Green, Yellow, or Red. During the execution of the snapshot, the snapshot fragments cannot be moved to another node, which may interfere with the rebalancing process and allocation filtering. This fragmentation migration can only be performed when the snapshot is completed.

After the snapshot starts, you can use the snapshot information API and status API to monitor the progress.

Get snapshot information

After the snapshot starts, use the following API to obtain the snapshot information:

curl -X GET "localhost:9200/_snapshot/my_backup/snapshot_1"

The summary of the returned information is as follows:

{
    " snapshots": [
    {
       "snapshot": "snapshot_1",
       "version": "6.1.2",
        "indices": [
            "website"
        ],
        "state": " SUCCESS",
        "start_time": "2018-05-15T03:40:06.571Z",
        "end_time": "2018-05-15T07:53:40.977Z",
        "duration_in_millis": 15214406,
        " failures": [] ,
        "shards": {
            "total": б,
            "failed": O,
            "successful": 6
        }
    }
  ]
}

It mainly includes basic information such as start and end time, cluster version, current stage, success and failure. During snapshot execution, it will go through the following stages, as shown in the following table.

stagebrief introduction
IN_PROGRESSSnapshot is running
SUCCESSSnapshot creation is complete and all fragments are stored successfully
FAILEDSnapshot creation failed and no data is stored
PARTIALThe global cluster status has been saved, but at least one fragmented data has not been successfully stored. The failure field returned contains detailed information about fragmentation not handled correctly
INCOMPATIBLEThe snapshot is incompatible with the current cluster version

Use the following command to obtain multiple snapshot information:

curl -X GET "localhost: 9200/_snapshot/my_backup/snapshot_*, other_snapshot"

And obtain all snapshot information under the specified warehouse:

curl -X GET "localhost:9200/_snapshot/my_backup/_all"

If the command fails because some snapshots are unavailable, you can set the Boolean parameter ignore_unavailable to return all currently available snapshots.

You can query running snapshots with the following command:

curl -X GET "localhost:9200/_snapshot/my_backup/_current"

Snapshot status

_ The status API is used to return the details of the snapshot.

You can use the following command to query the detailed status information of all currently running snapshots:

curl -X GET "localhost:9200/_snapshot/_status"

The information returned is summarized as follows:

"stats": {
    "number_of_files": 31,
    "processed_files": 31,
    "total_size_in_bytes": 33802,
    "processed_size_in_bytes": 33802,
    "start_time_in_millis": 1526355676967,
    "time_in_millis": 15144003
}

It is mainly the progress information such as the number of files and bytes processed, but it is not calculated in the form of percentage.

Use the following command to return information about all snapshots running in a specific warehouse:

curl -X GET "localhost:9200/_snapshot/my_backup/_status"

If both the warehouse name and the snapshot ID are specified, this command returns the status information of the specified snapshot, even if it has been executed:

curl -X GET "localhost:9200/_snapshot/my_backup/snapshot_1, snapshot_2/status"

Canceling, deleting snapshots, and restoring operations

In design, only one snapshot or recovery operation is allowed to run at the same point in time. If you want to terminate an ongoing snapshot operation, you can use the delete snapshot command to terminate it. The delete snapshot operation will check whether the current snapshot is running. If it is running, the delete operation will first stop the snapshot and then delete the data from the warehouse. If it is a completed snapshot, delete the snapshot data directly from the warehouse.

curl -x DELETE "localhost:9200/_snapshot/my_backup/snapshot_1"

The recovery operation uses the standard fragment recovery mechanism. Therefore, if you want to cancel a running recovery, you can do so by deleting the recovering index. Note that all index data will be deleted.

Restore from snapshot

To restore a snapshot, the target index must be turned off.

Restore a snapshot using the following command:

curl -X POST "localhost:9200/snapshot/my_backup/snapshot_1/_restore'

**By default, all indexes in the snapshot are restored, but the cluster state is not restored. By adjusting parameters, partial indexes and global cluster states can be selectively restored. Index lists support multiple index methods** For example:

curl -X POST " localhost:9200/snapshot/mybackup/snapshot_1/_restore" -H 'Content-Type: application/json' -d'
{
    "indices": "index_1, index_2" ,
    "ignore_unavailable": true,
    "include_global_state": true,
    "rename_pattern": "index_(.+)",
    "rename_replacement": "restored_index_$1"
}

The specific parameters are shown in the table below.

parameterbrief introduction
indicesList of indexes to recover. Support multi index syntax
ignore_unavailableSame meaning as snapshot
include_global_stateWhether to restore the cluster state. The default value is false. When set to true, the templates in the snapshot will be restored. If a template with the same name exists in the current cluster, it will be overwritten. The persistent settings are also restored
rename_patternas follows
rename_replacementIn conjunction with the previous parameter, rename the recovered index through a regular expression
partialWhether to allow partial data recovery when an error is encountered. The default is false

After the recovery is completed, the indexes and templates of the current cluster with the same name as the snapshot will be overwritten. Indexes, index aliases and templates that exist in the cluster but do not exist in the snapshot will not be deleted. Therefore, the recovery is not synchronized to be consistent with the snapshot.

Partial recovery

By default, during the recovery operation, if one or more indexes participating in the recovery have no available shards in the snapshot, the whole recovery operation fails. This may be caused by the failure of some fragmented backups when creating snapshots. You can restore as much as possible by setting the partial parameter to true. However, only fragments that have been successfully backed up can be successfully restored, and an empty fragment will be created for the lost fragment.

Changing index settings during recovery

**Most index settings can be overridden during recovery** For example, the following instructions will restore the index_ No replica is created at 1 and the default index refresh interval is used:

curl -X POST "localhost:9200/_snapshot/my_backup/snapshot_1/_restore" -H 'Content-Type: application/json' -d'
    "indices": "index_ 1",
    "index settings": {
        " index.number_of_replicas": 0
    }
    "ignore_index_settings": [
        "index.refresh_interval"
    ]
}

Some settings cannot be modified during recovery, such as index.number_of_shards.

Monitor recovery progress

The recovery process is based on the ES standard recovery mechanism, so the standard recovery monitoring service can be used to monitor the recovery status. When performing cluster recovery, it usually enters the Red state, because the recovery operation starts from the primary partition of the index. During this period, the primary partition state becomes unavailable, so the cluster state is Red. Once the ES primary partition is restored, the state of the entire cluster will be converted to Yellow, and the required number of secondary partitions will be created. Once all the necessary secondary shards are created, the cluster transitions to the Green state.

Viewing the health status of the cluster is only a high-level state during the recovery process. You can also use the API s of indexes recovery and cat recovery to obtain more detailed recovery process information and the current state information of the index.

How to create a snapshot

In the implementation principle of snapshot, we focus on several issues: how to implement snapshot? How is the incremental process implemented? Why does deleting an old snapshot not affect other snapshots?

The snapshot creation of ES is based on Lucene snapshot. However, the snapshot concept in Lucene is different from that in ES. Lucene snapshot is a snapshot of the last commit point. A snapshot contains the information of the last commit point and all segmented files. Therefore, this snapshot is actually a complete snapshot of the flushed disk data. Note that there is no concept of incremental snapshots in Lucene. Each time is a complete snapshot of the entire Lucene index, which represents the latest state of the Lucene index. It is called a snapshot because the physical files related to a Lucene snapshot are guaranteed not to be deleted from the time it is created. In Lucene, snapshots are implemented through SnapshotDeletionPolicy. Supported from Lucene version 2.3.

You may remember that during the recovery of the secondary partition, you also need to create a Lucene snapshot of the primary partition, and then copy the data files.

So to sum up:

  • Lucene snapshot is responsible for obtaining the latest segmented file list that has been swiped and ensuring that these files are not deleted. This file list is the file to be copied by ES.
  • ES is responsible for data replication, warehouse management, incremental backup, and snapshot deletion.

The overall process of creating a snapshot is shown in the following figure.

ES the process of creating snapshots involves three types of nodes:

  • Coordination node: receives client requests and forwards them to the master node.
  • Master node: broadcast the request information related to snapshot creation in the cluster state. The data node performs data replication after receiving it. At the same time, it is responsible for writing cluster status data in the warehouse.
  • Data node: it is responsible for copying Lucene files to the warehouse, and cleaning up files in the warehouse that are not related to any snapshot after data copying. Since the data is distributed in each data node, the replication operation must be performed by the data node. Each data node copies the primary partition stored locally in the snapshot request to the warehouse.

The snapshot process is a process of copying Lucene physical files. A Lucene index consists of many different types of files. For a complete introduction, please refer to Lucene's official manual. The address of the current version is: http://lucene.apache.org/core/7_3_1/index.html.

If the data node terminates abnormally during snapshot execution, for example, I/O error, the process is "kill ed" ", server power failure and other exceptions, the snapshot executed on this node has not been successful. When this node is restarted, the previous data replication process will not continue. For the whole snapshot process, the final result is partial success and partial failure. The failed nodes and fragments and the reasons related to the error will be recorded in the snapshot information.

Introduction to Lucene file format

1. Definitions

The basic concepts in Lucene include index, document, field and term. An index contains a series of documents:

  • A document is a series of fields
  • A field is a series of named terms
  • A term is a series of bytes

2. Sub paragraph

Lucene index may consist of multiple segments. Each segment is a completely independent index and can perform search independently. There are two situations to generate new segments:

  • The refresh operation produces a Lucene segment. Create a new segment for the newly added documents.
  • Existing segments are merged to produce new segments.

A search of Lucene index needs to search all segments.

3. File naming rules

All files belonging to a segment have the same name and different extensions. When using a composite index file (default), files other than. Si write.lock. Del are merged and compressed into a single. cfs file.

File names are not reused, that is, when any file is saved to a directory, it has a unique file name. This is achieved through a simple generation method. For example, the first segment file is named segments_1, followed by segments_2.

4. File extension summary

The following table summarizes the names and extensions of files in Lucene.

Let's start to analyze the execution processes of the three types of nodes.

Coordination node process

The coordination node is responsible for parsing the request and forwarding the request to the master node.

Processing thread: http_server_worker.

The REST action registered by the coordination node is create_snapshot_action, and the corresponding Handler is restcreate snapshotaction class. When the coordination node receives the client request, it processes the request in BaseRestHandler#handleRequest, calls RestCreateSnapshotAction#prepareRequest to parse the REST request, encapsulates the request as a CreateSnapshotRequest structure, and then sends the request to the Master Node.

public RestChannelConsumer prepareRequest (final RestRequest request, final NodeClient client) throws IOException {
    //Encapsulate the request as a CreateSnapshotRequest structure
    CreateSnapshotRequest createSnaps hotRequest = createSnapshotRequest(request.param ("repository"), request.param("snapshot"));
    request.applyContentParser(p -> createSnapshotRequest.source(p.mapOrdered()));
    //Set the timeout wait_for_completion and wait for parameter information
    createSnapshotRequest.masterNodeTimeout(request.paramAsTime("master_timeout", createSnapshotRequest.masterNodeTimeout()));
    createSnapshotRequest.waitForCompletion(request.par amAsBoolean("wait_for_completion", false));
    return channel -> client.admin().cluster().createSnapshot(createSnapshotRequest, new RestToXContentListener<> (channel));
}

In the TransportMasterNodeAction.AsyncSingleAction#doStart method, judge whether the local node is the primary node. If it is the primary node, transfer to the snapshot thread for processing. Otherwise, send an RPC request with action cluster:pdmin/snapshot/create to the primary node, and the request is the assembled CreateSnapshotRequest structure.

The code summary is as follows:
protected void doStart (ClusterState clusterState) {
    if (nodes. isLocalNodeElectedMaster () ll localExecute (request) ) {
        //The local node is the master node and processes requests in the new thread pool
        threadPool.executor(executor).execute(new ActionRunnable (delegate) {
            protected void doRun() throws Exception {
                masterOperation(task, request, clusterState, delegate) ;
            }
        ]);
    } else { //Forward to the master node. request is the assembled CreateSnapshotRequest structure
        transportService.sendRequest (masterNode, actionName, request, new ActionLi stenerResponseHandler <Response> (listener,
        TransportMasterNodeAction.this::newResponse)
    }
}

From the implementation point of view, both the coordination node and the master node will execute the transportmasternodeaction.asyncsingleaction t#dostart method, but the call chain is different.

Master node process

The main processing process of the master node is to convert the request into the internal required data structure and submit a cluster task for processing. The cluster state generated after the cluster task processing will contain the information of the request snapshot. The master node broadcasts the newly generated cluster state. After receiving it, the data node performs the snapshot processing of the corresponding actual data.

Thread pool for executing this process: http_server_worker - > snapshot - > masterservice#updatetask.

As described in the previous section, the request received by the master node from the coordination node is also processed in the TransportMasterNodeAction. AsyncSingleAction#doStart method. Execute the transportcreatesnapshot action#masteroperation method in the snapshot thread pool. Convert the received CreateSnapshotRequest request into a snapshotservice.snapshotrequest structure and call snapshotservice.create The esnapshot method submits a cluster task.

protected void masterOperation(.. .) {
    snapshotsService.createSnapshot (snapshotRequest, new SnapshotsService.CreateSnapshotListener () {
        public void onResponse () {
            //Register a Listener as needed and return a response to the client after the snapshot is executed
            if (request.waitForCompletion()) {
            snapshotsService.addListener (new SnapshotsService.SnapshotCompletionListener () {
                public void onSnapshotCompletion (Snapshot snapshot, SnapshotInfo snapshotInfo) {
                    ......
                }
                public void onSnapshotFailure (Snapshot snapshot, Exception e) {
                    ......
                }
            ]) ;
        } else {//The result is returned to the client, and the snapshot task is executed in the background
            listener.onResponse (new CreateSnapshotResponse());
        }
    }
    //Processing of execution failure
    public void onFailure (Exception е) {
        listener. onFailure (е) ;
    }
}) ;

We ignore unimportant details such as request verification, timeout processing and failure processing, and extract the main implementation of snapshotservice.createsnapshot as follows:

public void createSnapshot (final SnapshotRequest request, final CreateSnapshotListener listener) {
    clusterService.submitStateUpdateTask (request.cause(),new ClusterStateUpdateTask() {
        //Define tasks to perform
        public ClusterState execute (ClusterState currentState) {
            //Snapshot tasks cannot be executed in parallel. Only one snapshot can be executed at a time
            if (snapshots == null || snapshots.entries().isEmpty()) {
                //List of indexes to snapshot
                List<IndexId> snapshotIndices = repositoryData.resolveNewIndices (indices);
                newSnapshot = new SnapshotsInProgress.Entry (new Snapshot(repositoryName,snapshotId),
                    request.includeGlobalState(), 
                    request.partial(),
                    State.INIT, //The initial state is INIT 
                    snapshotIndices,
                    System.currentTimeMillis(),
                    repositoryData.getGenId(),
                    null) ;
                snapshots = new SnapshotsInProgress (newSnapshot);
            } else {
                throw new ConcurrentSnapshotExecut ionException (repositoryName, snapshotName," a snapshot is already running");
                //Generate a new cluster state based on the snapshots information
                return ClusterState.builder (currentState).putCustom(SnapshotsInProgress.TYPE,snapshots).build();
            }
            //After the state is INIT cluster state and the processing is successful, send the cluster state with state START
            public void clusterStateProcessed(String source, ClusterState oldState, final ClusterState newState) {
                if (newSnapshot != null) {
                    threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() ->
                        beginSnapshot (newState, newSnapshot, request.partia (), listener)
                );
            }
        }
    }
}

The submitted task is executed in the masterService#updateTask thread. In the task, verify whether the request is illegal and whether there is a snapshot in progress. Then put the relevant information of the snapshot request into the cluster state, broadcast it to all nodes of the cluster, and trigger the data nodes to process the actual data. The snapshot information is in the customers field of the cluster state, and its organization The structure is shown in the figure below.

The master node controls how the data node executes snapshots by broadcasting the snapshot command to be executed in the cluster information. During snapshot execution, the master node is divided into two steps and issues the cluster status twice. When the snapshot is sent for the first time, the State in the snapshot information is set to INIT, and the data node performs some initialization operations. After the data node places the cluster status After processing, the master node is ready to issue the second cluster status. The second cluster status is built in the snapshotservice#beginsnapshot method.

Before issuing the second cluster status, the master node will first write the global meta information and the metadata information of the index into the warehouse.

public void initializeSnapshot (SnapshotId snapshotId, List<IndexId> indices, MetaData clusterMetaData) {
    try {
        //Write global meta information
        globalMetaDataFormat.write (clusterMetaData, snapshotsBlobContainer, snapshotId.getUUID());
        //Write index level meta information for each index of the snapshot
        for (IndexId index : indices) {
            final IndexMetaData indexMetaData = clusterMetaData.index.(index.getName());
            final BlobPath indexPath = basePath().add ("indices").add(index.getId());
            final BlobContainer indexMetaDataBlobContainer = blobStore().blobContainer(indexPath) ;
            indexMetaDataFormat.write (indexMetaData, indexMetaDataBlobContainer, snapshotId.getUUID());
        } catch (IOException ех) {
            throw new SnapshotCreationException (metadata.name(), snapshotId, ex);
        }
    }
}

In the new cluster State, set the State to STARTED, and calculate the partition list according to the index list to be snapped (note that all are primary partitions). After receiving the snapshot, the data node really starts to execute the snapshot.

Data node process

The data node is responsible for the actual snapshot implementation, finding the fragments stored in this node from the list of all fragments to be snapshot, creating Lucene snapshots for these fragments and copying files.

1. Processing of ClusterState

The processing of the received cluster status is in the clusterApplierService#updateTask thread pool. When a snapshot is started, it is in the snapshot thread pool.

The data node uniformly processes the cluster state published by the master node. In the ClusterApplierService#callClusterStateListeners method, all modules that need to process the cluster state are stored in clusterStateListeners. When receiving the cluster status, traverse the list and call the corresponding processing functions of each module. The snapshot module handles this in the SnapshotShardsService#clusterChanged method. In this method, after some simple validation is done, SnapshotShardsServicet#processIndexShardSnapshots is called into the main processing logic. In fact, the data node does not do any meaningful operations for the first cluster state processing. The processing of the second cluster state is the core implementation of the real snapshot. The cluster status issued by the master node for the second time contains the fragment list to be snapshot. After receiving the data node, filter the local fragments and build a new list. The fragments to be snapped later are in this list.

for (Obj ectObjectCursor<ShardId, ShardSnapshotStatus> shard : entry.shards ()) {
    //Prepare the shards to be processed in this node
    if (localNodeId. equals (shard. value.nodeId())) {
        if (shard. value.state() == State. INIT && (snapshotShards == null || ! snapshotShard .shards.containsKey(shard.key))) {
            logger.trace("[(]] - Adding shard to the queue", shard.key);
            startedShards.put (shard.key, new IndexShardSnapshotStatus());
        }
    }
}
//List of shards to be processed in this node
newSnapshots.put(entry.snapshot(), startedshards) ;

Then traverse the local fragment list to be processed, and perform snapshot processing on the fragments in parallel in the snapshot thread pool. The number of parallel threads depends on the number of threads in the snapshot thread pool. The default maximum number of threads is min(5, (number of processors) / 2)

After processing, send an RPC request to the master node to update the snapshot status of the corresponding partition.

if (newSnapshots. isEmpty() == false) {
    Executor executor.= threadPool.executor (ThreadPool.Names.SNAPSHOT);
    for (final Map. Entry<Snapshot, Map<ShardId, IndexShardSnapshotStatus>> entry : newSnapshots.entrySet()) {
        //shard level concurrent execution snapshot
        for (final Map.Entry<ShardId, IndexShardSnapshotstatus> shardEntry : entry.getValue().entrySet()) {
            executor.execute (new Abs tractRunnable () {
                public void doRun() {
                    //Take a snapshot of a specific Shard
                    snapshot (indexShard, snapshot, indexId, shardEntry.getValue());
                        public void onAfter () {
                            //Send a request to the Master node to update the snapshot status
                            final Exception exception = failure.get ();
                            if (exception != null) l
                                final String failure = ExceptionsHelper.detailedMessage(exception) ;
                                notifyFailedSnapshotShard (snapshot, shardId, localNodeId, failure, masterNode) ;
                            } else {
                                notifySuccess fulSnapshotShard (snapshot, shardId, localNodeId, masterNode);
                            }
                        }
                ]) ;
            }
       }
}

2. Snapshot implementation of a specific partition

Now let's discuss the process of data node executing snapshot on a single shard. All shards execute the same process. This section is based on the processing of a single slice.

  • Lucene snapshot

Since the snapshot of ES is implemented based on Lucene snapshot, we will first introduce the implementation principle of Lucene snapshot in this section. The snapshot of Lucene is implemented in the SnapshotDeletionPolicy#snapshot method. This method returns a submission point through which you can obtain the latest status of fragments, including the list of all Lucene segmented files. From the time you get the list, the files in the list will not be deleted until the commit point is released.

Let's take a look at the implementation of Lucene snapshot:

IndexCommitRef (SnapshotDeletionPolicy deletionPolicy) throws IOException {
    //Call the Lucene interface to create a snapshot and return to the submission point
    indexCommit = deletionPolicy. snapshot() ;
    //After processing, release the snapshot resources
    onClose = () -> deletionPolicy. release (indexCommit) ;
}

The deletionPolicy is initialized with KeepOnlyLastCommitDeletionPolicy, which takes a snapshot of the last submission, including all Lucene segments, representing the latest state of Lucene index in the disk.

this.deletionPolicy = new CombinedDeletionPolicy (
new SnapshotDeletionPolicy (new KeepOnlyLastCommi tDeletionPolicy()), ...);
  • ES snapshot overall process

The data node executes the snapshotshardservice#snapshot method in the snapshot thread pool to create a snapshot of a specific fragment.

After doing some validation work, call the Lucene interface to create snapshots and return to Engine.IndexCommitRef, which contains the most important submission points. Then create an ES snapshot based on the Lucene commit point:

//Create a Lucene snapshot. Before the snapshot, execute the flush operation, but do not execute the refresh operation (fresh will be executed once during the flush operation)
try (Engine.IndexCommitRef snapshotRef = indexShard.acquireIndexCommit (true)) {
//Create snapshot based on Lucene commit point
repository.snapshotShard (indexShard, snapshot. getSnapshotId(), indexId, snapshotRef.getIndexCommit(),snapshotStatus);
}

We analyze the whole process through an example. In order to simplify the scenario, we create an index website with only one primary partition on a single node cluster. Write data, perform two commits, and create an ES snapshot after each commit. Execute the snapshot the second time_ 2, the data files stored in the ES data disk are as follows:


You can see that there are two segments in the Lucene file:_ 0. And_ 1.. Segment 0 has three files:_ 0.cfe,_ 0.cfs,_ 0.si. After obtaining the Lucene snapshot, the returned value snapshotRef contains the information of the submission point (indexCommit). The list of Lucene files is shown in the following figure.

Files contains all two Lucene segmented files. Next, process the Lucene submission point.

The core process of ES snapshot is to create a snapshot based on this submission point. The encapsulated method is BlobStoreRepository.SnapshotContext#snapshot. The main implementation process is described below.

  • Two lists are calculated based on the Lucene submission point

The current Lucene submission point represents the latest status of the partition, which contains all Lucene segments. If you do not consider incremental backup, you can copy all the file list to the warehouse. But what we want to achieve is that each snapshot is incremental.

The implementation method is to calculate two lists:

(1) The new file list represents the files to be copied to the warehouse. Traverse the file list in the Lucene submission point. If it already exists in the warehouse, filter it and get a new file list.

(2) The list of all files used by the current snapshot. In the future, it will be used to find all related files related to a snapshot. The contents of this list are all the files in the file list in the Lucene submission point.

The process of calculating the two file lists is as follows:

fileNames = snapshotIndexCommit.getFileNames () ;
//Traverse the file list in the Lucene submission point
for (String fileName : fileNames) {
    //Check if the snapshot needs to be aborted
    if (snapshotStatus. aborted() ) {
        throw new IndexShardSnapshotFailedException (shardId, "Aborted") ;
    }
    BlobStoreIndexShardSnapshot.FileInfo existingFileInfo = null;
    //Check whether the file exists in the warehouse. Due to the invariance of Lucene file, as long as the file name exists, the file is the same, and the checksum is not detected
    List<BlobStore IndexShardSnapshot.FileInfo> filesInfo = snapshots.findPhysicalIndexFiles (fileName) ;
    if (filesInfo != null) {
        for (BlobStore IndexShardSnapshot. FileInfo fileInfo : filesInfo) {
            if (fileInfo.isSame (md) && snapshotFileExistsInBlobs (fileInfo, blobs)) {
                // Already exists in warehouse
                existingFileInfo = fileInfo;
                break;
            }
        }
    }
    if (existingFileInfo == null) {
        BlobStoreIndexShardSnapshot. FileInfo snapshotFileInfo = new BlobStoreIndexShardSnapshot.FileInfo(.. .) ;
        indexCommitPointFiles.add (snapshotFileInfo) ;  //List of all files added to this snapshot
        filesToSnapshot. add (snapshotFileInfo) ;//Add to new file list
    } else {
        indexCommi tPointFiles . add (existingFileInfo) ;/ /List of all files added to this snapshot
    }
}

Here, we need to introduce the "invariance" of Lucene files. Except for write.lock and segments.gen, all other files will not be updated and will only be written once. The lock file write.lock does not need to be copied. Segments.gen is a file existing in the earlier version of Lucene, which we will not discuss. Therefore, all files that need to be copied remain unchanged without considering the possibility of being updated. Therefore, in incremental backup, the unique file can be identified by file name. However, when stored in the warehouse, ES renames all the files to the files with an increasing sequence number as the name, and maintains the corresponding relationship.

In this example, the calculated results of the two lists are shown in the figure below.

**The new file list filesToSnapshot only contains the new files from the last snapshot to the current snapshot** Take Article 1 as an example_ 4 is the target file name when copying to the warehouse, and 1.cfs is the source file name.

The complete file list indexCommitPointFiles contains all Lucene segments of the fragment and the last submission point file (segments_; 3), as shown in the following figure.

  • Copy Lucene physical files

Before starting replication, the snapshot task is set to the STARTED phase.

Now start copying new files, traverse the list of new files, and copy these files to the warehouse:

for (BlobStoreIndexShardSnapshot.FileInfo snapshotFileInfo : filesToSnapshot) {
    snapshotFile (snapshotFileInfo) ;
}

The speed limit is realized during the replication process, and the checksum is calculated. There is a calculated checksum in the meta information of each file in Lucene. During data replication, it is calculated while copying. After copying, compare whether the checksum is the same to verify whether the replication result is correct. Checksum is very important. When backing up data manually, we actually don't know whether the copied data is correct after copying. Checksum is the calculation of integer addition of data, which will not consume much CPU resources.

private void snapshotFile (final BlobStoreIndexShardSnapshot.FileInfo fileInfo) throws IOException {
    final String file = fileInfo.physicalName () ;
    try (IndexInput indexInput = store. openVeri fyingInput (file, IOContext.READONCE, fileInfo.metadata() ) ) {
        for (int i = O; i < fileInfo.numberOfParts() ; i++) {
            InputStream inputStream = inputStreamIndexInput;
            //Initialize speed limit module
            if (snapshotRateLimiter != null) {
                inputStream = new RateLimitingInputStream(inputStreamIndexInput, snapshotRateLimiter, snapshotRateLimitingTime InNanos :: inc) ;
            }
            inputStream = new AbortableInputStream (inputStream, fileInfo.physicalName()) ;
            //Copy file
            blobContainer.writeBlob (fileInfo.partName(i), inputStream, partBytes) ;
        }
        //Compare whether the checksum is correct. -- The second is the checksum stored in Lucene meta information. The second is to copy the data to the destination
        //The address process is calculated by the checksum
        Store.verify (indexInput) ;
        snapshotStatus.addProcessedFile (fileInfo.length()) ;
    }
}

The blobContainer.writeBlob of the copied file is a virtual method. It has different implementations for different warehouse file systems. For the shared file system (fs), the copy process is implemented through Streams.copy, and IOUtils.fsync is executed after the copy is completed.

  • Generate snapshot file

This file is the description information of the snapshot, including the description of this snapshot and its related Lucene file.

BlobStoreIndexShardSnapshot snapshot = new BlobStoreIndexShardSnapshot(snapshotId.getName(), snapshotIndexCommit.getGeneration(), indexCommitPointFiles, snapshotStatus.startTime (), System.currentTimeMillis() - snapshotStatus.startTime() , indexNumberOfFiles, indexTotalFilesSize) ;
//Write the snapshot file, which describes all relevant Lucene segments to be used in this snapshot
indexSha rdSnapshotFormat.write (snapshot, blobContainer, snapshotId.getUUID());

In this example, the contents of the snapshot file are shown in the following figure.

The file list represents the latest status of the partition, including all Lucene segments of the partition.

  • Delete snapshot unrelated files

Traverse all snapshots in the warehouse, including those just executed, and delete files in the warehouse that are not associated with any snapshots. The BlobStoreRepository.Context#finalize method is responsible for finding these files and deleting them.

protected void finalize (List<SnapshotFiles> snapshots, int fileListGeneration, Map<String,BlobMetaData> blobs) {
    BlobStore IndexShardSnapshots newSnapshots = new BlobStore IndexShardSnapshots (snapshots) ;
    //Delete the index index - * from the snapshot of the current partition. This index - * file is the index of the snapshot list
    //blobName is the next specific file in the warehouse, for example_ 1, index-*, snap-*
    for (String blobName : blobs.keySet()) {
        if (indexShardSnapshotsFormat.isTempBl obName (blobName) || blobName.startsWith (SNAPSHOT INDEX_ PREFIX)) {
            blobContainer.deleteBlob (blobName);
        }
        //Traverse the data files in the current fragment warehouse and delete the unused files in the snapshot
        for (String blobName : blobs.keySet()) {
            //Delete useless files, data blob prefix data_ BLOB_ The prefix value is
            if (blobName.startsWith (DATA_BLOB_PREFIX)) {
                if (newSnapshots.findNameFile (BlobStoreIndexShardSnapshot.FileInfo.canonicalName (blobName)) == null) {
                    blobContainer.deleteBlob (blobName) ;
                }
            }
        }
        //If all snapshots are deleted, the index - * index file is not created
        if (snapshots.size() > 0) {
            //Write index - * index file
            indexSha rdSnapshotsFormat.writeAtomic (newSnapshots, blobContainer, Integer.toString (fileListGeneration)) ;
        }
    }
}

The first parameter of the finalize method passed in here is the list of all snapshots of the fragment, and then three things are done:

  1. Delete the old index - * index file;
  2. Delete the files in the warehouse that are not used by snapshots. The snapshot list is passed in by the first parameter of the function, and the value is all snapshots of fragmentation;
  3. Create a new index - * index file.

After the finalize method is executed, the snapshot task is set to the DONE phase.

Think about what happens if the incoming snapshot list is not all snapshots but part of them in the parameter of the finalize method?

  • review

Briefly review the process of two snapshots. When you first create a snapshot, the snapshot_ The file list of 1 is shown in the following table.

The second time you create a snapshot, the snapshot_ The list of documents in Table 2 is shown in the left column of the following table. segments_1 has been deleted and does not belong to snapshot_2. Only bold files are snapshots_ 2. New content to be copied.

After the second snapshot file is executed, the file structure in the warehouse is as follows.

There are three types of files in the warehouse:

  • The data file prefixed with the underscore is the renamed segment file in Lucene.
  • snap-*.dat is a snapshot file, which describes the snapshot name, data files related to the snapshot (list of files used), etc.
  • *Index - describes all snapshot information of the current partition. It is the index file of the snapshot list.
  • **Meta refers to the metadata information of the index. There are also meta files in the root directory of the warehouse. The meta information of the global cluster is in the root directory.
-rw-rw-r--  1 avatar avatar   29 Aug  6 11:15 incompatible-snapshots
-rw-rw-r--  1 avatar avatar 3743 Aug 13 01:01 index-4
-rw-rw-r--  1 avatar avatar 2920 Aug 13 01:02 index-5
-rw-rw-r--  1 avatar avatar    8 Aug 13 01:02 index.latest
drwxrwxr-x 25 avatar avatar 4096 Aug 13 01:02 indices
-rw-rw-r--  1 avatar avatar  103 Aug  9 01:00 meta-GgFN1wAqQm65dDXZyEaJgg.dat
-rw-rw-r--  1 avatar avatar  103 Aug 13 01:00 meta-rPBlWtpkQgSp7eEAObC2UA.dat
-rw-rw-r--  1 avatar avatar  488 Aug  9 01:13 snap-GgFN1wAqQm65dDXZyEaJgg.dat
-rw-rw-r--  1 avatar avatar  493 Aug 13 01:01 snap-rPBlWtpkQgSp7eEAObC2UA.dat

How to delete a snapshot

The core idea of ES deleting snapshots is to delete files that are not used by any other snapshots in the physical files referenced by the snapshots to be deleted. Each snapshot describes the list of files used by this snapshot in its own meta information file (snap - *). When you want to delete some files, you do not need reference count. As long as the files to be deleted are not used by other snapshots, they can be deleted safely.

The snapshot deletion / cancellation process involves three types of nodes:

  • Coordination node: receives client requests and forwards them to the master node.
  • Master node: broadcast the request information related to deleting and creating snapshots in the cluster state. Deleting snapshots and canceling running snapshots are the same request. The data node is responsible for canceling the running snapshot creation task, and the master node is responsible for deleting the created snapshot. In any case, the cluster status will be broadcast. When the cluster status is published, the master node starts to delete. So now we know why the master node should also access the warehouse. It is really unnecessary to require each data node to perform the deletion operation. Any node can see all the data in the warehouse. Only a single node is required to perform the deletion. Therefore, the deletion operation is performed by the master node.
  • Data node: responsible for canceling the running snapshot task.

Coordination node process

The task of the coordination node is the same as that of the snapshot creation. The coordination node is responsible for parsing the request and forwarding the request to the master node.

Processing thread: http_server_worker.

The corresponding REST action for deleting a snapshot is delete_ snapshot_ action. The registered Handler is restdelete snapshotaction. After receiving the REST request to delete snapshot, it also processes in BaseRestHandler#handleRequest, then calls RestDeleteSnapshotAction#prepareRequest to parse REST request and encapsulates the request.

DeleteSnapshotRequest structure, and then send the request to the Master node.

Since the process is similar to that when creating a snapshot, we omit the reference to the relevant code.

Similarly, in the TransportMasterNodeAction.AsyncSingleAction#doStart method, judge whether the local node is the primary node. If it is the primary node, the local node will execute in the snapshot thread pool. Otherwise, it will be forwarded. The requested action is cluster:admin/snapshot/delete.

Master node process

After receiving the request from the coordination node, the master node submits the cluster task and broadcasts the request information to the new cluster state. After receiving the request, the data node checks whether there are running snapshot tasks that need to be cancelled. If not, it will not do other operations. After the cluster status of the primary node is published successfully, execute the snapshot deletion operation.

Thread pool for executing this process: http_ server_ worker->generic->masterService#updateTask->snapshot

The request received by the master node from the coordination node is also processed in the TransportMasterNodeAction.AsyncSingleAction#doStart method. Execute TransportDeleteSnapshotAction#masterOperation in the generic thread pool, and then call snapsetsservice#deletesnapshot to submit the cluster task.

1. Submit cluster tasks

Put the delete snapshot request information into the cluster status. When the cluster status is published successfully, execute the delete snapshot logic.

private void deleteSnapshot(...) {
    //Submit cluster task
    clusterService.submitStateUpdateTask ("delete snapshot", new ClusterStateUpdateTask (priority) {
        public ClusterState execute (ClusterState currentState) throws Exception {
            SnapshotsInProgress snapshots = currentState.custom(SnapshotsInProgress.TYPE);
            SnapshotsInProgress.Entry snapshotEntry = snapshots!= null ? snapshots.snapshot (snapshot) : null; 
            if (snapshotEntry == null) {
                //If the snapshot is not running, put the deletion request information in the Custom field
                //In the SnapshotDeletionsInProgress structure
                clusterStateBuilder.putCustom (SnapshotDeletions InProgress.TYPE, deletionsInProgress) ;
            } else {
                //Put the information related to the delete snapshot request into the SnapshotsInProgress of customers,
                //And set state to ABORTED
                SnapshotsInProgress . Entry newSnapshot = new SnapshotsInProgress.Entry (snapshotEntry, State.ABORTED, shards) ;
                snapshots = new SnapshotsInProgress (newSnapshot) ;
                clusterStateBuilder.putCustom (SnapshotsInProgress.TYPE, snapshots) ;
            }
            snapshots = new SnapshotsInProgress (newSnapshot) ;
            clusterStateBuilder.putCustom (Snapshots InProgress.TYPE,snapshots) ;
            return clusterStateBuilder.build() ;
        }
        //Handle the failure of cluster status publishing
        public void onFailure (String source, Exception e) {}
        //The cluster status is published
        public void clusterStateProcessed (String source, ClusterState oldState,ClusterState newState) {
            if (waitForSnapshot) {
                //Handle client waiting for completion
            } else
                //Delete snapshot file
                deleteSnapshotFromRepository (snapshot, listener, repositoryStateId) ;
            }
        }
    }) ;
}

The master node will judge whether the snapshot to be deleted is in progress or completed. Cancel the snapshot in progress and delete the completed snapshot. The cluster status is different.

For the deletion process, the distributed cluster status content is shown in the following figure. The deletion request information is placed in the snapshot deletionsinprogress field of customers.

For the cancellation process, the distributed cluster status content is shown in the following figure. The deletion request information is placed in the SnapshotsInProgress field of customers, and the State is set to ABORTED. When creating a snapshot, it is also placed in the SnapshotsInProgress field. The difference is that the State is STARTED when creating a snapshot.

Since the deletion operation is performed on the primary node, next we enter the snapshot deletion process of the primary node.

2. Snapshot deletion

After publishing the cluster state of the master node, the clusterStateProcessed method is responsible for the processing logic after successful publishing. The thread pool executing this method is masterService#updateTask, which calls the snapshotservice#deletesnapshotfromrepository method to delete. This method will transfer to the snapshot thread pool to perform specific deletion.

private void deleteSnapshotFromRepository(...) {
    threadPool.executor (ThreadPool.Names.SNAPSHOT).execute(() --> {
        //Execute delete
        repository.deleteSnapshot (snapshot.getSnapshotId(),repositoryStateId) ;
        removeSnapshotDeletionFromClusterState (snapshot, null, listener) ;
    }) ;
}

The contents to be deleted include meta information files, index fragments, and the entire index directory that may be deleted, and update the index file (index - *) of the snapshot list. The main deletion logic is as follows:

public void deleteSnapshot (SnapshotId snapshotId, 1ong repositoryStateId) {
    MetaData metaData = readSnapshotMetaData (snapshotId, snapshot.version() , repositoryData.resolveIndices (indices),true);
    try {
        //Update the index file index - *, and delete the snapshot information from the snapshot list
        final RepositoryData updatedRepositoryData = repositoryData.removeSnapshot (snapshotId) ;
        writeIndexGen (updatedRepositoryData, repositoryStateId) ;
        //Delete the snap of the global snapshot file (under the root directory of the warehouse)-*
        deleteSnapshotBlobIgnoringErrors (snapshot, snapshotId.getUUID()) ;
        //Delete the global meta information file meta - *. Dat
        deleteGlobalMetaDataBlobIgnoringErrors (snapshot, snapshotId.getUUID()) ;
        //Delete all indexes
        for (String index : indices) {
            //Delete the meta information file meta - *. Dat of the index
            indexMetaDataFormat.delete (indexMetaDataBlobContainer, snapshotId.getUUID()) ;
            if (metaData != null) {
                IndexMetaData indexMetaData = metaData.index (index) ;
                if (indexMetaData != null) {
                    for (int shardId= 0; shardId < indexMetaData.getNumberOfShards () ; shardId++) {
                        //Delete the fragment snapshot. Here, delete the actual Lucene file
                        delete (snapshotId, snapshot. version(),indexId, new ShardId (indexMetaData.getIndex(),shardId) ) ;
                    }
                }
            }
        }
        //Delete the index directory that no longer exists in the warehouse
        for (final IndexId indexId : indicesToCleanUp) {
            indicesBlobContainer . deleteBlob(indexId.getId() ) ;
        }
    }
}

Snapshots are created for each fragment. How to delete fragment snapshots is the core process. Next, let's see how to delete a fragmented snapshot:

public void delete() {
    Tuple<BlobStoreIndexShardSnapshots, Integer> tuple = buildBlobStore IndexShardSnapshots (blobs) ;
    //List of existing snapshots in the warehouse
    BlobStore IndexShardSnapshots snapshots = tuple.v1() ;
    int fileListGeneration = tuple.v2 () ;
    //Delete fragmented snapshot file
    indexShardSnapshotFormat (version).delete (blobContainer, snapshotId.getUUID()) ;
    //Build a list of snapshots to keep
    List<SnapshotFiles> newSnapshotsList = new ArrayList<> () ;
    for (SnapshotFiles point : snapshots) {
        if (!point.snapshot ().equals (snapshotId.getName())) {
            newSnapshotsList.add (point) ;
        }
    }
    //Pass in the list of snapshots to keep, not the list of snapshots to delete. Files that exist in the warehouse but are not used by snapshots in the list are deleted
    finalize (newSnapshotsList, fileListGeneration + 1, blobs) ;
}

Call the same finalize method when creating snapshots. When calling this method, a snapshot list is passed in. During internal execution, traverse the files in the warehouse and delete the files that are not referenced by the snapshot list. When you create a snapshot, you pass in the list of all snapshots. When you delete a snapshot, you pass in the list of snapshots to be retained.

Based on the example in creating a snapshot in the previous section, we delete the snapshot_2. See which files have been deleted. Delete snapshot in_ Before 2, there were snapshots in the warehouse_ 1,snapshot_2 two snapshots.

belong to snapshot_1 The documents are:
_0
_1
_2
_3

belong to snapshot _2 The files are:
_0
_1
_2
_3
_4
_5
_6
_7

Delete snapshot_2, the snapshot list passed in by the finalize method is snapshot_1. The final deleted files are 4, 5, 6 and 7, as shown in the figure below. The left column is the existing files in the fragment warehouse, and the right column is the snapshot_1 file list.

3. Cancellation process of data node

The snapshot cancellation request information is placed in the SnapshotsInProgress field of customers, and the State is ABORTED. The data node handles this in the snapshotshardservice#processindexshardsnaps method. The main process of creating snapshots is also in this method. It is judged that it is necessary to start or cancel running snapshots according to the State state.

private void process IndexShardSnapshots (ClusterChangedEvent event) {
    if (snapshotsInProgress !=null) {
        for (SnapshotsInProgress . Entry entry : snapshots InProgress.entries()) {
            if (entry.state() == State . STARTED) {
                //Process snapshot creation
            } else if (entry.state() == State. ABORTED) {
                //Processing snapshot cancellation
                if (snapshotShards != null) {
                    for (Obj ectObjectCursor<ShardId, ShardSnapshotStatus> shard : entry.shards () ) {
                        if (snapshotStatus != null) {
                            switch (snapshotStatus.stage()) {
                                case INIT:
                                case STARTED:
                                    //Set abort ID
                                    snapshotStatus.abort() ;
                                case ....
                            }
                        }
                    }
                }
            }
        }
    }
}

The abort implementation of canceling snapshots only sets the abort flag, which will be checked by running snapshots:

public void abort() {
    this.aborted = true;
}

During snapshot operation, the abort ID will be checked at several places:

  • When calculating the list of Lucene files to be copied;
  • Before starting replication;
  • During the reading process after the data starts copying data.

Since the running snapshot performs data replication most of the time, the cancellation operation is mostly interrupted when reading data.

public int read (byte[] b,int off,int len) throws IOException {
    checkAborted() ;
    return in. read(b, off, len) ;
}

private void checkAborted() {
    if (snapshotStatus.aborted() ) {
        throw new IndexShardSnapshotFailedException (shardId, "Aborted") ;
    }
}

After the running snapshot is cancelled, the master node is responsible for cleaning up half of the snapshot data files copied to it. This process is executed in the snapshot deletion logic after the master node successfully publishes the cluster status. For a cancelled snapshot, the normal snapshot deletion process is executed.

Thinking and summary

  • The master node broadcasts the snapshot command in the cluster state to control the data node to perform tasks. The data node actively reports the status to the master node after execution.
  • The configuration file of ES cannot take effect dynamically after being updated. However, a REST interface is provided to adjust the parameters that need to be updated dynamically. The path.repo field needs to be written to the configuration file. When you need to migrate data, you must first change the configuration and restart the cluster, which is not convenient enough. Why not put it in the REST request information and require the configuration file?
  • The cluster permanent settings and templates are saved in the cluster state. The default is no snapshot and recovery. Note that the index alias is not in the cluster state. Snapshots save alias information by default.
  • Lucene segment merging will result in new content during incremental snapshot. When the segment file is small, many small files may be generated in HDFS. So through force_ The merge API also helps to reduce these small files on HDFS by manually merging segments.
  • The snapshot writes metadata information at two levels: cluster layer and index layer.
  • Snapshots have nothing to do with the health of the cluster. You can also take snapshots of some indexes when the cluster is Red.
  • The checksum will be calculated during data replication to ensure the correctness of the copied data.
  • When data nodes copy data concurrently, it depends on the maximum number of threads in the thread pool, which is min(5, (number of processors) / 2).
  • Snapshots are performed only on the primary shard.

Official account of my brother, big dry cargo.

Tags: ElasticSearch

Posted on Fri, 08 Oct 2021 04:03:52 -0400 by eraxian