Getting started with document of elastic search

6.1. Default field resolution

{
  "_index" : "book",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 10,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "Bootstrap Development tutorial 1",
    "description" : "Bootstrap By Twitter Development of a front page launched css Framework is a very popular development framework, which integrates a variety of page effects. This development framework contains a large number of CSS,JS Program code, can help developers (especially not good at css Page developers) easily implement a css,Beautiful interface without browser restrictions css effect.",
    "studymodel" : "201002",
    "price" : 38.6,
    "timestamp" : "2019-08-25 19:11:35",
    "pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
    "tags" : [
      "bootstrap",
      "development"
    ]
  }
}

6.1.1 _index

  • Meaning: which index this document belongs to

  • Principle: similar data is placed in an index. Definition rules for tables in the database. For example, the book information is placed in the book index, and the employee information is placed in the employee index. Each index stores and searches without affecting each other.

  • Definition rule: English lowercase. Try not to use special characters.

6.1.2 _type

  • Meaning: category. book java node

  • Note: in the future, es9 will completely delete this field, so the current version is weakening the type. No need to pay attention. See_ All types are doc.

6.1.3 _id

Meaning: the unique identification of the document. It's like the id primary key of a table. A combined index identifies and defines a document.

Generate: manually (put /index/_doc/id), automatic

6.1.4 when creating an index, different data is put into different indexes

6.2. Generate document id

6.2.1 generate id manually

Scenario: when data is imported from other systems, it has a unique primary key. Such as books and employee information in the database.

Usage: put /index/_doc/id

PUT /test_index/_doc/1
{
  "test_field": "test"
}

6.2.2 automatically generate id

Usage: POST /index/_doc

POST /test_index/_doc
{
  "test_field": "test1"
}

return:

{
  "_index" : "test_index",
  "_type" : "_doc",
  "_id" : "x29LOm0BPsY0gSJFYZAl",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

Automatic id features:

Length of 20 characters, URL security, base64 encoding, GUID, no conflict in distributed generation

6.3. _ source field

6.3.1 _source

Meaning: all fields and values when inserting data. When get gets data, the_ The source field is returned as the original.

GET /book/_doc/1

6.3.2 custom return field

Just like sql, do not select *, but select name,price from book Same.

GET /book/doc/1?source_includes=name,price

{
  "_index" : "book",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 10,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "price" : 38.6,
    "name" : "Bootstrap Development tutorial 1"
  }
}

6.4. Replacement and deletion of documents

6.4.1 full replacement

Execute twice to return the version number (_ Version) is on the rise. This process is full replacement.

PUT /test_index/_doc/1
{
  "test_field": "test"
}

Substance: the contents of the old document will not be deleted immediately, only marked as deleted. At the right time, the cluster will delete these documents.

6.4.2 forced creation

To prevent the original data from being overwritten, we set it as forced creation when adding, and the original document will not be overwritten.

Syntax: PUT /index/ doc/id/create

PUT /test_index/_doc/1/_create
{
  "test_field": "test"
}

return

{
  "error": {
    "root_cause": [
      {
        "type": "version_conflict_engine_exception",
        "reason": "[2]: version conflict, document already exists (current version [1])",
        "index_uuid": "lqzVqxZLQuCnd6LYtZsMkg",
        "shard": "0",
        "index": "test_index"
      }
    ],
    "type": "version_conflict_engine_exception",
    "reason": "[2]: version conflict, document already exists (current version [1])",
    "index_uuid": "lqzVqxZLQuCnd6LYtZsMkg",
    "shard": "0",
    "index": "test_index"
  },
  "status": 409
}

6.4.3 deletion

DELETE /index/_doc/id

DELETE  /test_index/_doc/1/

Substance: the contents of the old document will not be deleted immediately, only marked as deleted. At the right time, the cluster will delete these documents.

lazy delete

6.5. partial update

Use PUT /index/type/id to replace the whole document. All data of the document needs to be submitted.

partial update local replacement only modifies the change fields.

Usage:

post /index/type/id/_update 
{
   "doc": {
      "field": "value"
   }
}

Schematic internal principles

 

Internal and full replacement are the same. The old document is marked for deletion and a new document is created.

advantage:

  • Greatly reduce network transmission times and traffic, and improve performance

  • Reduce the probability of concurrent conflicts.

demonstration

Insert document

PUT /test_index/_doc/5
{
  "test_field1": "hello",
  "test_field2": "yfy"
}

Modify field 1

POST /test_index/_doc/5/_update
{
  "doc": {
    "test_field2": " yfy 2"
  }
}

6.6. Update with script

es can perform complex operations with built-in scripts. For example, the painless script.

Note: groovy scripts are not supported after es6. The reason is memory consumption and insecure remote injection vulnerability.

6.6.1 built in script

Requirement 1: modify the num field of document 6, + 1.

insert data

PUT /test_index/_doc/6
{
  "num": 0,
  "tags": []
}

Execute script operation

POST /test_index/_doc/6/_update
{
   "script" : "ctx._source.num+=1"
}

Query data

GET /test_index/_doc/6

return

{
  "_index" : "test_index",
  "_type" : "_doc",
  "_id" : "6",
  "_version" : 2,
  "_seq_no" : 23,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "num" : 1,
    "tags" : [ ]
  }
}

Requirement 2: search all documents and multiply num field by 2 to output

insert data

PUT /test_index/_doc/7
{
  "num": 5
}

query

GET /test_index/_search
{
  "script_fields": {
    "my_doubled_field": {
      "script": {
       "lang": "expression",
        "source": "doc['num'] * multiplier",
        "params": {
          "multiplier": 2
        }
      }
    }
  }
}

return

{
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "7",
        "_score" : 1.0,
        "fields" : {
          "my_doubled_field" : [
            10.0
          ]
        }
      }

6.6.2 external script

Painless is built-in support. Script content can be transferred to es through various ways, including rest interface, or put into config/scripts directory, etc., which is enabled by default.

Note: script performance is low and injection is easy to occur, which is ignored in this tutorial.

Official documents: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-using.html

6.7. Concurrent problems of graphical es

As for seckill, in the case of multithreading, es will also encounter concurrency conflicts.

 

6.8. Graphic pessimistic lock and optimistic lock mechanism

 

In order to control concurrency, we usually use lock mechanism. There are two mechanisms: pessimistic lock and optimistic lock.

Pessimistic lock: very pessimistic, all situations are locked. At this point, only one thread can operate on the data. Specific examples are row level lock, table level lock, read lock, write lock, etc.

Features: the advantages are convenient, direct lock, transparent to the program. The disadvantage is low efficiency.

Optimistic lock: very optimistic, do not lock the data itself. When submitting data, a mechanism is used to verify whether there is a conflict, such as version number verification in es.

Features: the advantage is high concurrency. The disadvantage is that the operation is cumbersome. When submitting data, it may try again and again many times.

6.9. Diagram es internal based on_ version optimistic lock control

Based on_ Version control of version

 

es is based on the version number.

1 add multiple documents:

PUT /test_index/_doc/3
{
  "test_field": "test"
}

Return version number increment

2 delete this document

DELETE /test_index/_doc/3

return

DELETE /test_index/_doc/3
{
  "_index" : "test_index",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 6,
  "result" : "deleted",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 7,
  "_primary_term" : 1
}

3 add again

PUT /test_index/_doc/3
{
  "test_field": "test"
}

You can see that the version number is still increasing, and verify the delayed deletion policy.

If you delete a piece of data immediately, all the fragments and copies must be deleted immediately, which is too much pressure on the es cluster.

 

When the master and slave in es are synchronized, they are multithreaded and asynchronous. Optimistic locking mechanism.

6.10. Demo client based on_ version concurrent operation process

java python client update mechanism.

New document

PUT /test_index/_doc/5
{
  "test_field": "itcast"
}

return:

{
  "_index" : "test_index",
  "_type" : "_doc",
  "_id" : "3",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 8,
  "_primary_term" : 1
}

Client 1 modified. With version number 1.

First get the current version number of the data

GET /test_index/_doc/5

Update document

PUT /test_index/_doc/5?version=1
{
  "test_field": "itcast1"
}
PUT /test_index/_doc/5?if_seq_no=21&if_primary_term=1
{
  "test_field": "itcast1"
}

Client 2 is modified concurrently. With version number 1.

PUT /test_index/_doc/5?version=1
{
  "test_field": "itcast2"
}
PUT /test_index/_doc/5?if_seq_no=21&if_primary_term=1
{
  "test_field": "itcast1"
}

report errors.

Client 2 re queries. Get the latest version 2. seq_no=22

GET /test_index/_doc/4

Client 2 is modified concurrently. With version number 2.

PUT /test_index/_doc/4?version=2
{
  "test_field": "itcast2"
}
es7
PUT /test_index/_doc/5?if_seq_no=22&if_primary_term=1
{
  "test_field": "itcast2"
}

Modification succeeded.

6.11. Demonstrate that you can manually control version number external version

Background: when the existing data is in the database and has its own version number maintained manually, external version control can be used. hbase.

Requirement: the external version should be larger than the current document's_ version

Comparison: Based on_ When version, the modified document version is equal to the version number of the current document.

Use? Version = 1 & version_ type=external

New document

PUT /test_index/_doc/4
{
  "test_field": "itcast"
}

Update document:

Client 1 modify document

PUT /test_index/_doc/4?version=2&version_type=external
{
  "test_field": "itcast1"
}

Client 2 modify at the same time

PUT /test_index/_doc/4?version=2&version_type=external
{
  "test_field": "itcast2"
}

return:

{
  "error": {
    "root_cause": [
      {
        "type": "version_conflict_engine_exception",
        "reason": "[4]: version conflict, current version [2] is higher or equal to the one provided [2]",
        "index_uuid": "-rqYZ2EcSPqL6pu8Gi35jw",
        "shard": "1",
        "index": "test_index"
      }
    ],
    "type": "version_conflict_engine_exception",
    "reason": "[4]: version conflict, current version [2] is higher or equal to the one provided [2]",
    "index_uuid": "-rqYZ2EcSPqL6pu8Gi35jw",
    "shard": "1",
    "index": "test_index"
  },
  "status": 409
}

Client 2 re query data

GET /test_index/_doc/4

Client 2 modifies data again

PUT /test_index/_doc/4?version=3&version_type=external
{
  "test_field": "itcast2"
}

6.12. Retry when updating_ On_ Conflict parameter

retry_on_conflict

Specify the number of retries

POST /test_index/_doc/5/_update?retry_on_conflict=3
{
  "doc": {
    "test_field": "itcast1"
  }
}

And_ version in combination

POST /test_index/_doc/5/_update?retry_on_conflict=3&version=22&version_type=external
{
  "doc": {
    "test_field": "itcast1"
  }
}

6.13. Batch query mget

Single query GET /test_index/_doc/1, if you query documents with multiple IDS one by one, the network overhead is too large.

mget batch query:

GET /_mget
{
   "docs" : [
      {
         "_index" : "test_index",
         "_type" :  "_doc",
         "_id" :    1
      },
      {
         "_index" : "test_index",
         "_type" :  "_doc",
         "_id" :    7
      }
   ]
}

return:

{
  "docs" : [
    {
      "_index" : "test_index",
      "_type" : "_doc",
      "_id" : "2",
      "_version" : 6,
      "_seq_no" : 12,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "test_field" : "test12333123321321"
      }
    },
    {
      "_index" : "test_index",
      "_type" : "_doc",
      "_id" : "3",
      "_version" : 6,
      "_seq_no" : 18,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "test_field" : "test3213"
      }
    }
  ]
}

Prompt to remove type

GET /_mget
{
   "docs" : [
      {
         "_index" : "test_index",
         "_id" :    2
      },
      {
         "_index" : "test_index",
         "_id" :    3
      }
   ]
}

Batch query under the same index:

GET /test_index/_mget
{
   "docs" : [
      {
         "_id" :    2
      },
      {
         "_id" :    3
      }
   ]
}

The third writing method: search writing method

post /test_index/_doc/_search
{
    "query": {
        "ids" : {
            "values" : ["1", "7"]
        }
    }
}

6.14. Batch adding, deleting, and modifying bulk

Bulk operation explains a series of operations to add, delete, modify, and query documents, which are all completed through one request. Reduce the number of network transfers.

Syntax:

POST /_bulk
{"action": {"metadata"}}
{"data"}

Delete 5, add 14, and modify 2.

POST /_bulk
{ "delete": { "_index": "test_index",  "_id": "5" }} 
{ "create": { "_index": "test_index",  "_id": "14" }}
{ "test_field": "test14" }
{ "update": { "_index": "test_index",  "_id": "2"} }
{ "doc" : {"test_field" : "bulk test"} }

Summary:

1 function:

  • Delete: delete a document, just one json string

  • Create: equivalent to force the creation of PUT /index/type/id/_create

  • index: the normal put operation can be to create documents or to replace documents in full quantity

  • Update: partial update is performed

2 format: each json cannot wrap. Adjacent json must wrap.

3 isolation: each operation does not affect each other. The row whose operation failed will return its failure information.

4 practical usage: the bulk request should not be too large at one time, otherwise the performance will be degraded if it is overstocked into memory. Therefore, thousands of operations are requested at a time, and the size is just a few meters.

6.15. Document concept learning summary

Chapter review

1 addition, deletion, modification and query of documents

2 document field analysis

3 internal locking mechanism

4 batch query and modification

What is es

A distributed document store system. es is a distributed nosql database. For example, redis\mongoDB\hbase.

Document data: es can store and manipulate json document type data, which is also the core data structure of es. Storage system: es can store, query, create, update, delete and so on the data of json document type.

Application scenario

  • big data. The distributed characteristics of es, the horizontal expansion of large data.

  • Flexible data structure. Columns change at any time. Using relational database will build a large number of association tables and increase the system complexity.

  • Data operation is simple. It's a query, not a transaction.

give an example

E-commerce page, traditional forum page, etc. The object-oriented is more complex, but as a terminal, it does not have too complex functions (transactions), only involves simple addition, deletion, modification and query crud.

At this time, ES, the NoSQL data storage, is more suitable than the traditional complex and powerful relational database. Both performance and throughput may be better.

Tags: Database JSON network Java

Posted on Thu, 04 Jun 2020 09:40:52 -0400 by literom