前言:
現在是2019.10.11,最近工作比較忙,小竈時間比較少,現在工作結束,可以繼續學習了,敲開心!
- index與create的區別: index的功能比create強一點,也是爲什麼廣泛使用的原因,他的作用是如果文檔不存在,則索引新的文檔,如果文檔已經存在,則會刪除現有文檔,新的文檔會被索引,並且版本號verson會被+1。這點和update還是有區別的。
- index與update的卻別: update方法不會像index一樣刪除原來的文檔,而是實現真正的數據更新,但是如果使用update方法,在請求體body中就要指明doc字段,例如
POST user/_update/1 { "doc":{ "name":"xxx" } }
下面是簡單的crud操作
POST user/_doc { "user":"Mike", "post_date" : "2019-10-11 17:44:00", "message" : "trying out kibana" } PUT user/_doc/1 //這裏會默認使用index方式 { "user":"Mike", "post_date" : "2019-10-11 17:44:00", "message" : "trying out kibana" } GET user/_doc/1 //指定方式,因爲mike之前已經創建過了,又使用create方式,所以會報錯 PUT user/_doc/1?op_type=create { "user":"Mike", "post_date" : "2019-10-11 17:44:00", "message" : "trying out kibana" }
2019-10-12
- 正排索引:舉個栗子,書本的章節與目錄的關係,看到第幾頁,你就知道第幾章了,這就是正派索引,在搜索引擎中對應的就是-文檔id與文檔內容和單詞的關聯。
- 倒排索引: 舉個栗子,書本的單詞,出現在第幾頁,根據單詞,你就知道所在頁面,在搜索引擎中就是單詞到文檔id的對應關係。
上圖左側是正排索引,右側爲倒排索引
- 倒排索引有兩個部分
- 單詞詞典:記錄文檔所有的單詞,以及單詞與倒排列表的關聯關係。(單詞詞典一般比較大,可以使用B+樹或哈希拉鍊法來實現高性能的插入與查詢)
- 倒排列表:記錄了單詞與對應文檔的結合,由倒排索引項組成。倒排索引項包含一下幾點:
- 文檔id
- 詞頻:TF,一個單詞在文檔中出現的次數,用於相關性評分。
- 位置:單詞在文檔中出現的位置,用於語句搜索,(phrase query)
- 偏移:記錄單詞的開始結束位置,實現高亮顯示。
- 分詞-standard
GET _analyze { "analyzer": "standard", "text": "i am PHPerJiang" }
es默認的分詞器是standard,以下是分詞結果
{ "tokens" : [ { "token" : "i", "start_offset" : 0, "end_offset" : 1, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "am", "start_offset" : 2, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "phperjiang", "start_offset" : 5, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 2 } ] }
standard分詞器會去除符號,並將大寫轉換爲小寫,然後根據空格進行分詞
-
分詞-whitespace
GET _analyze { "analyzer": "whitespace", "text": "33 i am PHPer-jiang,i am so good。" }
分詞結果
{ "tokens" : [ { "token" : "33", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "i", "start_offset" : 3, "end_offset" : 4, "type" : "word", "position" : 1 }, { "token" : "am", "start_offset" : 5, "end_offset" : 7, "type" : "word", "position" : 2 }, { "token" : "PHPer-jiang,i", "start_offset" : 8, "end_offset" : 21, "type" : "word", "position" : 3 }, { "token" : "am", "start_offset" : 22, "end_offset" : 24, "type" : "word", "position" : 4 }, { "token" : "so", "start_offset" : 25, "end_offset" : 27, "type" : "word", "position" : 5 }, { "token" : "good。", "start_offset" : 28, "end_offset" : 33, "type" : "word", "position" : 6 } ] }
whitespace分詞器只根據空格進行分詞,保留符號
-
分詞-stop
GET _analyze { "analyzer": "stop", "text": "33 i am PHPer-jiang,i am so good。the history is new history" }
分詞結果
{ "tokens" : [ { "token" : "i", "start_offset" : 3, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "am", "start_offset" : 5, "end_offset" : 7, "type" : "word", "position" : 1 }, { "token" : "phper", "start_offset" : 8, "end_offset" : 13, "type" : "word", "position" : 2 }, { "token" : "jiang", "start_offset" : 14, "end_offset" : 19, "type" : "word", "position" : 3 }, { "token" : "i", "start_offset" : 20, "end_offset" : 21, "type" : "word", "position" : 4 }, { "token" : "am", "start_offset" : 22, "end_offset" : 24, "type" : "word", "position" : 5 }, { "token" : "so", "start_offset" : 25, "end_offset" : 27, "type" : "word", "position" : 6 }, { "token" : "good", "start_offset" : 28, "end_offset" : 32, "type" : "word", "position" : 7 }, { "token" : "history", "start_offset" : 37, "end_offset" : 44, "type" : "word", "position" : 9 }, { "token" : "new", "start_offset" : 48, "end_offset" : 51, "type" : "word", "position" : 11 }, { "token" : "history", "start_offset" : 52, "end_offset" : 59, "type" : "word", "position" : 12 } ] }
stop分詞器與standard分詞器相比,會過濾掉 the,is,in 等修飾性詞語,去除符號和數字,然後進行分詞,同樣是大寫轉爲小寫
-
分詞-keyword
GET _analyze { "analyzer": "keyword", "text": "33 i am PHPer-jiang,i am so good。the history is new history" }
分詞結果
{ "tokens" : [ { "token" : "33 i am PHPer-jiang,i am so good。the history is new history", "start_offset" : 0, "end_offset" : 59, "type" : "word", "position" : 0 } ] }
keyword分詞器其實不會進行分詞,text當做一個整體分詞。
-
分詞-pattern
GET _analyze { "analyzer": "pattern", "text": "33 i am PHPer-jiang,i am so good。the history is new history % hahah" }
結果如下
{ "tokens" : [ { "token" : "33", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "i", "start_offset" : 3, "end_offset" : 4, "type" : "word", "position" : 1 }, { "token" : "am", "start_offset" : 5, "end_offset" : 7, "type" : "word", "position" : 2 }, { "token" : "phper", "start_offset" : 8, "end_offset" : 13, "type" : "word", "position" : 3 }, { "token" : "jiang", "start_offset" : 14, "end_offset" : 19, "type" : "word", "position" : 4 }, { "token" : "i", "start_offset" : 20, "end_offset" : 21, "type" : "word", "position" : 5 }, { "token" : "am", "start_offset" : 22, "end_offset" : 24, "type" : "word", "position" : 6 }, { "token" : "so", "start_offset" : 25, "end_offset" : 27, "type" : "word", "position" : 7 }, { "token" : "good", "start_offset" : 28, "end_offset" : 32, "type" : "word", "position" : 8 }, { "token" : "the", "start_offset" : 33, "end_offset" : 36, "type" : "word", "position" : 9 }, { "token" : "history", "start_offset" : 37, "end_offset" : 44, "type" : "word", "position" : 10 }, { "token" : "is", "start_offset" : 45, "end_offset" : 47, "type" : "word", "position" : 11 }, { "token" : "new", "start_offset" : 48, "end_offset" : 51, "type" : "word", "position" : 12 }, { "token" : "history", "start_offset" : 52, "end_offset" : 59, "type" : "word", "position" : 13 }, { "token" : "hahah", "start_offset" : 62, "end_offset" : 67, "type" : "word", "position" : 14 } ] }
pattern分詞器是正則分詞,採用\W+,即非字母的符號進行分詞,如上 %haha %、空格、逗號、句號均爲非字母字符,所以進行了分詞
-
分詞器-analysis-icu
GET _analyze { "analyzer": "icu_analyzer", "text": "八百標兵奔北坡" }
分詞結果
{ "tokens" : [ { "token" : "八百", "start_offset" : 0, "end_offset" : 2, "type" : "<IDEOGRAPHIC>", "position" : 0 }, { "token" : "標兵", "start_offset" : 2, "end_offset" : 4, "type" : "<IDEOGRAPHIC>", "position" : 1 }, { "token" : "奔北", "start_offset" : 4, "end_offset" : 6, "type" : "<IDEOGRAPHIC>", "position" : 2 }, { "token" : "坡", "start_offset" : 6, "end_offset" : 7, "type" : "<IDEOGRAPHIC>", "position" : 3 } ] }
icu分詞會根據中文磁性進行分詞,推薦兩個分詞,ik,thula
-
分詞器-ik
-
安裝:
-
進入es的bin目錄執行
elasticsearch-plugin list
,查看當前已有插件
-
若沒有analysis-ik分詞插件則下載安裝與es相同版本或者高於es版本的插件,低版本的安裝會報錯
elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.0/elasticsearch-analysis-ik-7.4.0.zip
-
-
使用
-
ik_smart
GET _analyze { "analyzer": "ik_smart", "text": "中華人民共和國" }
分詞結果
{ "tokens" : [ { "token" : "中華人民共和國", "start_offset" : 0, "end_offset" : 7, "type" : "CN_WORD", "position" : 0 } ] }
ik_smart會根據最粗可顆粒度拆分,如中華人民共和國,會拆分爲中華人民共和國,適合pharse短語查詢
-
ik_max_word
GET _analyze { "analyzer": "ik_max_word", "text": "中華人民共和國" }
分詞如下
{ "tokens" : [ { "token" : "中華人民共和國", "start_offset" : 0, "end_offset" : 7, "type" : "CN_WORD", "position" : 0 }, { "token" : "中華人民", "start_offset" : 0, "end_offset" : 4, "type" : "CN_WORD", "position" : 1 }, { "token" : "中華", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 2 }, { "token" : "華人", "start_offset" : 1, "end_offset" : 3, "type" : "CN_WORD", "position" : 3 }, { "token" : "人民共和國", "start_offset" : 2, "end_offset" : 7, "type" : "CN_WORD", "position" : 4 }, { "token" : "人民", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 5 }, { "token" : "共和國", "start_offset" : 4, "end_offset" : 7, "type" : "CN_WORD", "position" : 6 }, { "token" : "共和", "start_offset" : 4, "end_offset" : 6, "type" : "CN_WORD", "position" : 7 }, { "token" : "國", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 8 } ] }
ik_max_word會根據最細顆粒度分詞,適合term-query查詢。
-
-
2019-10-23更
- bulk批量處理
POST _bulk {"index":{"_index":"user","_id":1}} {"name":"PHPer"} {"create":{"_index":"user1","_id":1}} {"name":"Gopher"} {"update":{"_index":"user1","_id":1}} {"doc":{"name":"PHPer"}} {"delete":{"_index":"user1","_id":1}}
index: 創建,如果已經存在,則刪除已有的保存新的,並且版本號+1。而create發現id已經存在,如果再創建會報錯,update要指定修改數據是對doc進行操作的,以下是返回結果
{ "took" : 21, "errors" : false, "items" : [ { "index" : { "_index" : "user", "_type" : "_doc", "_id" : "1", "_version" : 26, "result" : "updated", "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "_seq_no" : 25, "_primary_term" : 1, "status" : 200 } }, { "create" : { "_index" : "user1", "_type" : "_doc", "_id" : "1", "_version" : 1, "result" : "created", "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "_seq_no" : 12, "_primary_term" : 1, "status" : 201 } }, { "update" : { "_index" : "user1", "_type" : "_doc", "_id" : "1", "_version" : 2, "result" : "updated", "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "_seq_no" : 13, "_primary_term" : 1, "status" : 200 } }, { "delete" : { "_index" : "user1", "_type" : "_doc", "_id" : "1", "_version" : 3, "result" : "deleted", "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "_seq_no" : 14, "_primary_term" : 1, "status" : 200 } } ] }
我們能看到,bluk每條都會產生一個反饋,反覆執行會發現create操作是處於報錯狀態的,因爲要操作的文檔在es中已經存在了。而index、update、delete操作都能正常執行且版本號+1.