elasticsearch系列三：索引詳解（分詞器、文檔管理、路由詳解（集羣））

原文鏈接：https://www.cnblogs.com/leeSmall/p/9195782.html

一、分詞器

1. 認識分詞器

1.1 Analyzer 分析器

在ES中一個Analyzer 由下面三種組件組合而成：

character filter ：字符過濾器，對文本進行字符過濾處理，如處理文本中的html標籤字符。處理完後再交給tokenizer進行分詞。一個analyzer中可包含0個或多個字符過濾器，多個按配置順序依次進行處理。
tokenizer：分詞器，對文本進行分詞。一個analyzer必需且只可包含一個tokenizer。
token filter：詞項過濾器，對tokenizer分出的詞進行過濾處理。如轉小寫、停用詞處理、同義詞處理。一個analyzer可包含0個或多個詞項過濾器，按配置順序進行過濾。

1.2 如何測試分詞器

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}

POST _analyze
{
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "Is this déja vu?"
}

position：第幾個詞

offset：詞的偏移位置

2. 內建的character filter

HTML Strip Character Filter
　　html_strip ：過濾html標籤，解碼HTML entities like &.
Mapping Character Filter
　　mapping ：用指定的字符串替換文本中的某字符串。
Pattern Replace Character Filter
　　pattern_replace ：進行正則表達式替換。

2.1 HTML Strip Character Filter

POST _analyze
{
  "tokenizer":      "keyword", 
  "char_filter":  [ "html_strip" ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

在索引中配置：

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]
        }
      }
    }
  }
}

escaped_tags 用來指定例外的標籤。如果沒有例外標籤需配置，則不需要在此進行客戶化定義，在上面的my_analyzer中直接使用 html_strip

測試：

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

2.2 Mapping character filter

官網鏈接：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "٠ => 0",
            "١ => 1",
            "٢ => 2",
            "٣ => 3",
            "٤ => 4",
            "٥ => 5",
            "٦ => 6",
            "٧ => 7",
            "٨ => 8",
            "٩ => 9"
          ]
        }
      }
    }
  }
}

測試

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My license plate is ٢٥٠١٥"
}

2.3 Pattern Replace Character Filter

官網鏈接：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}

測試

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}

3. 內建的Tokenizer

官網鏈接：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

Standard Tokenizer
Letter Tokenizer
Lowercase Tokenizer
Whitespace Tokenizer
UAX URL Email Tokenizer
Classic Tokenizer
Thai Tokenizer
NGram Tokenizer
Edge NGram Tokenizer
Keyword Tokenizer
Pattern Tokenizer
Simple Pattern Tokenizer
Simple Pattern Split Tokenizer
Path Hierarchy Tokenizer

前面集成的中文分詞器Ikanalyzer中提供的tokenizer：ik_smart 、 ik_max_word

測試tokenizer

POST _analyze
{
  "tokenizer":      "standard", 
  "text": "張三說的確實在理"
}

POST _analyze
{
  "tokenizer":      "ik_smart", 
  "text": "張三說的確實在理"
}

4. 內建的Token Filter

ES中內建了很多Token filter ，詳細瞭解：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

Lowercase Token Filter ：lowercase 轉小寫
Stop Token Filter ：stop 停用詞過濾器
Synonym Token Filter： synonym 同義詞過濾器

說明：中文分詞器Ikanalyzer中自帶有停用詞過濾功能。

4.1 Synonym Token Filter 同義詞過濾器

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "my_ik_synonym" : {
                        "tokenizer" : "ik_smart",
                        "filter" : ["synonym"]
                    }
                },
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                         <!-- synonyms_path：指定同義詞文件（相對config的位置）-->
                        "synonyms_path" : "analysis/synonym.txt"
                    }
                }
            }
        }
    }
}

同義詞定義格式

ES同義詞格式支持 solr、 WordNet 兩種格式。

在analysis/synonym.txt中用solr格式定義如下同義詞

張三,李四
電飯煲,電飯鍋 => 電飯煲
電腦 => 計算機,computer

注意：

文件一定要UTF-8編碼

一行一類同義詞，=> 表示標準化爲

測試：通過例子的結果瞭解同義詞的處理行爲

POST test_index/_analyze
{
  "analyzer": "my_ik_synonym",
  "text": "張三說的確實在理"
}

POST test_index/_analyze
{
  "analyzer": "my_ik_synonym",
  "text": "我想買個電飯鍋和一個電腦"
}

5. 內建的Analyzer

官網鏈接：

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

Standard Analyzer
Simple Analyzer
Whitespace Analyzer
Stop Analyzer
Keyword Analyzer
Pattern Analyzer
Language Analyzers
Fingerprint Analyzer

集成的中文分詞器Ikanalyzer中提供的Analyzer：ik_smart 、 ik_max_word

內建的和集成的analyzer可以直接使用。如果它們不能滿足我們的需要，則我們可自己組合字符過濾器、分詞器、詞項過濾器來定義自定義的analyzer

5.1 自定義 Analyzer

配置參數：

PUT my_index8
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ik_analyzer": {
          "type": "custom",
          "tokenizer": "ik_smart",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
             "synonym"
          ]
        }
      },
      "filter": {
        "synonym": {
          "type": "synonym",
          "synonyms_path": "analysis/synonym.txt"
        }
      }    }  }}

5.2 爲字段指定分詞器

PUT my_index8/_mapping/_doc
{
  "properties": {
    "title": {
        "type": "text",
        "analyzer": "my_ik_analyzer"
    }
  }
}

如果該字段的查詢需要使用不同的analyzer

PUT my_index8/_mapping/_doc
{
  "properties": {
    "title": {
        "type": "text",
        "analyzer": "my_ik_analyzer",
        "search_analyzer": "other_analyzer" 
    }
  }
}

測試結果

PUT my_index8/_doc/1
{
  "title": "張三說的確實在理"
}

GET /my_index8/_search
{
  "query": {
    "term": {
      "title": "張三"
    }
  }
}

5.3 爲索引定義個default分詞器

PUT /my_index10
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "ik_smart",
          "filter": [
            "synonym"
          ]
        }
      },
      "filter": {
        "synonym": {
          "type": "synonym",
          "synonyms_path": "analysis/synonym.txt"
        }
      }
    }
  },
"mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text"
        }
      }
    }
  }
}

測試結果：

PUT my_index10/_doc/1
{
  "title": "張三說的確實在理"
}

GET /my_index10/_search
{
  "query": {
    "term": {
      "title": "張三"
    }
  }
}

6. Analyzer的使用順序

我們可以爲每個查詢、每個字段、每個索引指定分詞器。

在索引階段ES將按如下順序來選用分詞：

首先選用字段mapping定義中指定的analyzer
字段定義中沒有指定analyzer，則選用 index settings中定義的名字爲default 的analyzer。
如index setting中沒有定義default分詞器，則使用 standard analyzer.

查詢階段ES將按如下順序來選用分詞：

The analyzer defined in a full-text query.
The search_analyzer defined in the field mapping.
The analyzer defined in the field mapping.
An analyzer named default_search in the index settings.
An analyzer named default in the index settings.
The standard analyzer.

二、文檔管理

1. 新建文檔

指定文檔id，新增/修改

PUT twitter/_doc/1
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

新增，自動生成文檔id

POST twitter/_doc/
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

返回結果說明：

2. 獲取單個文檔

HEAD twitter/_doc/11

GET twitter/_doc/1

不獲取文檔的source：

GET twitter/_doc/1?_source=false

獲取文檔的source：

GET twitter/_doc/1/_source

{
  "_index": "twitter",
  "_type": "_doc",
  "_id": "1",
  "_version": 2,
  "found": true,
  "_source": {
    "id": 1,
    "user": "kimchy",
    "post_date": "2009-11-15T14:12:12",
    "message": "trying out Elasticsearch"
  }}

獲取存儲字段

PUT twitter11
{
   "mappings": {
      "_doc": {
         "properties": {
            "counter": {
               "type": "integer",
               "store": false
            },
            "tags": {
               "type": "keyword",
               "store": true
            } }   }  }}

PUT twitter11/_doc/1
{
    "counter" : 1,
    "tags" : ["red"]
}

GET twitter11/_doc/1?stored_fields=tags,counter

3. 獲取多個文檔 _mget

方式1：

GET /_mget
{
    "docs" : [
        {
            "_index" : "twitter",
            "_type" : "_doc",
            "_id" : "1"
        },
        {
            "_index" : "twitter",
            "_type" : "_doc",
            "_id" : "2"
            "stored_fields" : ["field3", "field4"]
        }
    ]
}

方式2：

GET /twitter/_mget
{
    "docs" : [
        {
            "_type" : "_doc",
            "_id" : "1"
        },
        {
            "_type" : "_doc",
            "_id" : "2"
        }
    ]
}

方式3：

GET /twitter/_doc/_mget
{
    "docs" : [
        {
            "_id" : "1"
        },
        {
            "_id" : "2"
        }
    ]
}

方式4：

GET /twitter/_doc/_mget
{
    "ids" : ["1", "2"]
}

4. 刪除文檔

指定文檔id進行刪除

DELETE twitter/_doc/1

用版本來控制刪除

DELETE twitter/_doc/1?version=1

返回結果：

{
    "_shards" : {
        "total" : 2,
        "failed" : 0,
        "successful" : 2
    },
    "_index" : "twitter",
    "_type" : "_doc",
    "_id" : "1",
    "_version" : 2,
    "_primary_term": 1,
    "_seq_no": 5,
    "result": "deleted"
}

查詢刪除

POST twitter/_delete_by_query
{
  "query": { 
    "match": {
      "message": "some message"
    }
  }
}

當有文檔有版本衝突時，不放棄刪除操作（記錄衝突的文檔，繼續刪除其他複合查詢的文檔）

POST twitter/_doc/_delete_by_query?conflicts=proceed
{
  "query": {
    "match_all": {}
  }
}

通過task api 來查看查詢刪除任務

GET _tasks?detailed=true&actions=*/delete/byquery

查詢具體任務的狀態

GET /_tasks/taskId:1

取消任務

POST _tasks/task_id:1/_cancel

5. 更新文檔

指定文檔id進行修改

PUT twitter/_doc/1
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

樂觀鎖併發更新控制

PUT twitter/_doc/1?version=1
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

返回結果

{
  "_index": "twitter",
  "_type": "_doc",
  "_id": "1",
  "_version": 3,
  "result": "updated",
  "_shards": {
    "total": 3,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 2,
  "_primary_term": 3
}

6.Scripted update 通過腳本來更新文檔

6.1 準備一個文檔

PUT uptest/_doc/1
{
    "counter" : 1,
    "tags" : ["red"]
}

6.2、對文檔1的counter + 4

POST uptest/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    }
}

6.3、往數組中加入元素

POST uptest/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.tags.add(params.tag)",
        "lang": "painless",
        "params" : {
            "tag" : "blue"
        }
    }
}

腳本說明：painless是es內置的一種腳本語言，ctx執行上下文對象（通過它還可訪問_index, _type, _id, _version, _routing and _now (the current timestamp) ），params是參數集合

說明：腳本更新要求索引的_source 字段是啓用的。更新執行流程：

a、獲取到原文檔
b、通過_source字段的原始數據，執行腳本修改。
c、刪除原索引文檔
d、索引修改後的文檔
它只是降低了一些網絡往返，並減少了get和索引之間版本衝突的可能性。

6.4、添加一個字段

POST uptest/_doc/1/_update
{
    "script" : "ctx._source.new_field = 'value_of_new_field'"
}

6.5、移除一個字段

POST uptest/_doc/1/_update
{
    "script" : "ctx._source.remove('new_field')"
}

6.6、判斷刪除或不做什麼

POST uptest/_doc/1/_update
{
    "script" : {
        "source": "if (ctx._source.tags.contains(params.tag)) { ctx.op = 'delete' } else { ctx.op = 'none' }",
        "lang": "painless",
        "params" : {
            "tag" : "green"
        }
    }
}

6.7、合併傳人的文檔字段進行更新

POST uptest/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    }
}

6.8、再次執行7，更新內容相同，不需做什麼

{
  "_index": "uptest",
  "_type": "_doc",
  "_id": "1",
  "_version": 4,
  "result": "noop",
  "_shards": {
    "total": 0,
    "successful": 0,
    "failed": 0
  }
}

6.9、設置不做noop檢測

POST uptest/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    },
    "detect_noop": false
}

什麼是noop檢測？

即已經執行過的腳本不再執行

6.10、upsert 操作：如果要更新的文檔存在，則執行腳本進行更新，如不存在，則把 upsert中的內容作爲一個新文檔寫入。

POST uptest/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    },
    "upsert" : {
        "counter" : 1
    }
}

7. 通過條件查詢來更新文檔

滿足查詢條件的才更新

POST twitter/_update_by_query
{
  "script": {
    "source": "ctx._source.likes++",
    "lang": "painless"
  },
  "query": {
    "term": {
      "user": "kimchy"
    }
  }
}

8. 批量操作

批量操作API /_bulk 讓我們可以在一次調用中執行多個索引、刪除操作。這可以大大提高索引數據的速度。批量操作內容體需按如下以新行分割的json結構格式給出：

語法：

action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
....
action_and_meta_data\n
optional_source\n

說明：

action_and_meta_data: action可以是 index, create, delete and update ，meta_data 指: _index ,_type,_id 請求端點可以是: /_bulk, /{index}/_bulk, {index}/{type}/_bulk

示例：

POST _bulk
{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "_doc", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "_doc", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "_doc", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }

8.1 curl + json 文件批量索引多個文檔

注意：accounts.json要放在執行curl命令的同等級目錄下，後續學習的測試數據基本都使用這份銀行的數據了

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_doc/_bulk?pretty&refresh" --data-binary "@accounts.json"

accounts.json：

View Code

9. reindex 重索引

Reindex API /_reindex 讓我們可以將一個索引中的數據重索引到另一個索引中（拷貝），要求源索引的_source 是開啓的。目標索引的setting 、mapping 信息與源索引無關。

什麼時候需要重索引？

即當需要做數據的拷貝的時候

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}

重索引要考慮的一個問題：目標索引中存在源索引中的數據，這些數據的version如何處理。

1、如果沒有指定version_type 或指定爲 internal，則會是採用目標索引中的版本，重索引過程中，執行的就是新增、更新操作。

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "internal"
  }
}

2、如果想使用源索引中的版本來進行版本控制更新，則設置 version_type 爲extenal。重索引操作將寫入不存在的，更新舊版本的數據。

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "external"
  }
}

如果你只想從源索引中複製目標索引中不存在的文檔數據，可以指定 op_type 爲 create 。此時存在的文檔將觸發版本衝突（會導致放棄操作），可設置“conflicts”: “proceed“，跳過繼續

POST _reindex
{
  "conflicts": "proceed",
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "op_type": "create"
  }
}

你也可以只索引源索引的一部分數據，通過 type 或查詢來指定你需要的數據

POST _reindex
{
  "source": {
    "index": "twitter",
    "type": "_doc",
    "query": {
      "term": {
        "user": "kimchy"
      }
    }
  },
  "dest": {
    "index": "new_twitter"
  }
}

可以從多個源獲取數據

POST _reindex
{
  "source": {
    "index": ["twitter", "blog"],
    "type": ["_doc", "post"]
  },
  "dest": {
    "index": "all_together"
  }
}

可以限定文檔數量

POST _reindex
{
  "size": 10000,
  "source": {
    "index": "twitter",
    "sort": { "date": "desc" }
  },
  "dest": {
    "index": "new_twitter"
  }
}

可以選擇複製源文檔的哪些字段

POST _reindex
{
  "source": {
    "index": "twitter",
    "_source": ["user", "_doc"]
  },
  "dest": {
    "index": "new_twitter"
  }
}

可以用script來改變文檔

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "external"
  },
  "script": {
    "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}",
    "lang": "painless"
  }
}

可以指定路由值把文檔放到哪個分片上

POST _reindex
{
  "source": {
    "index": "source",
    "query": {
      "match": {
        "company": "cat"
      }
    }
  },
  "dest": {
    "index": "dest",
    "routing": "=cat"
  }
}

從遠程源複製

POST _reindex
{
  "source": {
    "remote": {
      "host": "http://otherhost:9200",
      "username": "user",
      "password": "pass"
    },
    "index": "source",
    "query": {
      "match": {
        "test": "data"
      }
    }
  },
  "dest": {
    "index": "dest"
  }
}

通過_task 來查詢執行狀態

GET _tasks?detailed=true&actions=*reindex

10. refresh

對於索引、更新、刪除操作如果想操作完後立馬重刷新可見，可帶上refresh參數

PUT /test/_doc/1?refresh
{"test": "test"}
PUT /test/_doc/2?refresh=true
{"test": "test"}

refresh 可選值說明

未給值或=true，則立馬會重刷新讀索引。
=false ，相當於沒帶refresh 參數，遵循內部的定時刷新。
=wait_for ，登記等待刷新，當登記的請求數達到index.max_refresh_listeners 參數設定的值時(defaults to 1000)，將觸發重刷新。

三、路由詳解

1. 集羣組成

第一個節點啓動

說明：首先啓動的一定是主節點，主節點存儲的是集羣的元數據信息

Node2啓動

說明：

Node2節點啓動之前會配置集羣的名稱Cluster-name：ess，然後配置可以作爲主節點的ip地址信息discovery.zen.ping.unicast.hosts: [“10.0.1.11",“10.0.1.12"]，配置自己的ip地址networ.host: 10.0.1.12；

Node2啓動的過程中會去找到主節點Node1告訴Node1我要加入到集羣裏面了，主節點Node1接收到請求以後看Node2是否滿足加入集羣的條件，如果滿足就把node2的ip地址加入的元信息裏面，然後廣播給集羣中的其他節點有

新節點加入，並把最新的元信息發送給其他的節點去更新

Node3..NodeN加入

說明：集羣中的所有節點的元信息都是和主節點一致的，因爲一旦有新的節點加入進來，主節點會通知其他的節點同步元信息

2. 在集羣中創建索引的流程

3. 有索引的集羣

4. 集羣有節點出現故障，如主節點掛了，會重新選擇主節點

5. 在集羣中索引文檔

索引文檔的步驟：
1、node2計算文檔的路由值得到文檔存放的分片（假定路由選定的是分片0）。
2、將文檔轉發給分片0(P0)的主分片節點 node1。
3、node1索引文檔，同步給副本（R0）節點node3索引文檔。
4、node1向node2反饋結果
5、node2作出響應

6. 文檔是如何路由的

文檔該存到哪個分片上？
決定文檔存放到哪個分片上就是文檔路由。ES中通過下面的計算得到每個文檔的存放分片：

shard = hash(routing) % number_of_primary_shards

參數說明：

routing 是用來進行hash計算的路由值，默認是使用文檔id值。我們可以在索引文檔時通過routing參數指定別的路由值

number_of_primary_shards：創建索引時指定的主分片數

POST twitter/_doc?routing=kimchy
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

在索引、刪除、更新、查詢中都可以使用routing參數（可多值）指定操作的分片。

創建索引時強制要求給定路由值：

PUT my_index2
{
  "mappings": {
    "_doc": {
      "_routing": {
        "required": true 
      }
    }
  }
}

7. 在集羣中進行搜索

搜索的步驟：如要搜索索引 s0
1、node2解析查詢。
2、node2將查詢發給索引s0的分片/副本（R1,R2,R0）節點
3、各節點執行查詢，將結果發給Node2
4、Node2合併結果，作出響應。

8. Master節點的工作是什麼？

1. 存儲集羣的元信息，如集羣名稱、集羣中的節點

2. 轉發創建索引和索引文檔的請求

3. 和其他的節點進行通信，告訴其他節點有新的節點加入等