ES7.x，相關摘要【更新完畢，更新至分詞器】

前言：

現在是2019.10.11，最近工作比較忙，小竈時間比較少，現在工作結束，可以繼續學習了，敲開心！

index與create的區別： index的功能比create強一點，也是爲什麼廣泛使用的原因，他的作用是如果文檔不存在，則索引新的文檔，如果文檔已經存在，則會刪除現有文檔，新的文檔會被索引，並且版本號verson會被+1。這點和update還是有區別的。

index與update的卻別： update方法不會像index一樣刪除原來的文檔，而是實現真正的數據更新，但是如果使用update方法，在請求體body中就要指明doc字段，例如

POST user/_update/1
{
    "doc":{
        "name":"xxx"
    }
}

下面是簡單的crud操作

POST user/_doc
{
  "user":"Mike",
  "post_date" : "2019-10-11 17:44:00",
  "message" : "trying out kibana"
}

PUT user/_doc/1     //這裏會默認使用index方式
{
  "user":"Mike",
  "post_date" : "2019-10-11 17:44:00",
  "message" : "trying out kibana"
}

GET user/_doc/1

//指定方式，因爲mike之前已經創建過了，又使用create方式，所以會報錯
PUT user/_doc/1?op_type=create   
{
   "user":"Mike",
  "post_date" : "2019-10-11 17:44:00",
  "message" : "trying out kibana"
}

2019-10-12

正排索引：舉個栗子，書本的章節與目錄的關係，看到第幾頁，你就知道第幾章了，這就是正派索引，在搜索引擎中對應的就是-文檔id與文檔內容和單詞的關聯。
倒排索引: 舉個栗子，書本的單詞，出現在第幾頁，根據單詞，你就知道所在頁面，在搜索引擎中就是單詞到文檔id的對應關係。

上圖左側是正排索引，右側爲倒排索引

倒排索引有兩個部分
1. 單詞詞典：記錄文檔所有的單詞，以及單詞與倒排列表的關聯關係。（單詞詞典一般比較大，可以使用B+樹或哈希拉鍊法來實現高性能的插入與查詢）
2. 倒排列表：記錄了單詞與對應文檔的結合，由倒排索引項組成。倒排索引項包含一下幾點：
  1. 文檔id
  2. 詞頻：TF，一個單詞在文檔中出現的次數，用於相關性評分。
  3. 位置：單詞在文檔中出現的位置，用於語句搜索，（phrase query）
  4. 偏移：記錄單詞的開始結束位置，實現高亮顯示。

分詞-standard

GET _analyze
{
  "analyzer": "standard",
  "text": "i am PHPerJiang"
}

es默認的分詞器是standard，以下是分詞結果

{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "phperjiang",
      "start_offset" : 5,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

standard分詞器會去除符號，並將大寫轉換爲小寫，然後根據空格進行分詞

分詞-whitespace

GET _analyze
{
  "analyzer": "whitespace",
  "text": "33 i am PHPer-jiang,i am so good。"
}

分詞結果

{
  "tokens" : [
    {
      "token" : "33",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "i",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "am",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "PHPer-jiang,i",
      "start_offset" : 8,
      "end_offset" : 21,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "am",
      "start_offset" : 22,
      "end_offset" : 24,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "so",
      "start_offset" : 25,
      "end_offset" : 27,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "good。",
      "start_offset" : 28,
      "end_offset" : 33,
      "type" : "word",
      "position" : 6
    }
  ]
}

whitespace分詞器只根據空格進行分詞,保留符號

分詞-stop

GET _analyze
{
  "analyzer": "stop",
  "text": "33 i am PHPer-jiang,i am so good。the history is new history"
}

分詞結果

{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "phper",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "jiang",
      "start_offset" : 14,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "i",
      "start_offset" : 20,
      "end_offset" : 21,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "am",
      "start_offset" : 22,
      "end_offset" : 24,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "so",
      "start_offset" : 25,
      "end_offset" : 27,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "good",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "history",
      "start_offset" : 37,
      "end_offset" : 44,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "new",
      "start_offset" : 48,
      "end_offset" : 51,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "history",
      "start_offset" : 52,
      "end_offset" : 59,
      "type" : "word",
      "position" : 12
    }
  ]
}

stop分詞器與standard分詞器相比，會過濾掉 the，is,in 等修飾性詞語，去除符號和數字，然後進行分詞，同樣是大寫轉爲小寫

分詞-keyword

GET _analyze
{
  "analyzer": "keyword",
  "text": "33 i am PHPer-jiang,i am so good。the history is new history"
}

分詞結果

{
  "tokens" : [
    {
      "token" : "33 i am PHPer-jiang,i am so good。the history is new history",
      "start_offset" : 0,
      "end_offset" : 59,
      "type" : "word",
      "position" : 0
    }
  ]
}

keyword分詞器其實不會進行分詞，text當做一個整體分詞。

分詞-pattern

GET _analyze
{
  "analyzer": "pattern",
  "text": "33 i am PHPer-jiang,i am so good。the history is new history % hahah"
}

結果如下

{
  "tokens" : [
    {
      "token" : "33",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "i",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "am",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "phper",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "jiang",
      "start_offset" : 14,
      "end_offset" : 19,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "i",
      "start_offset" : 20,
      "end_offset" : 21,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "am",
      "start_offset" : 22,
      "end_offset" : 24,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "so",
      "start_offset" : 25,
      "end_offset" : 27,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "good",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "the",
      "start_offset" : 33,
      "end_offset" : 36,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "history",
      "start_offset" : 37,
      "end_offset" : 44,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "is",
      "start_offset" : 45,
      "end_offset" : 47,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "new",
      "start_offset" : 48,
      "end_offset" : 51,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "history",
      "start_offset" : 52,
      "end_offset" : 59,
      "type" : "word",
      "position" : 13
    },
    {
      "token" : "hahah",
      "start_offset" : 62,
      "end_offset" : 67,
      "type" : "word",
      "position" : 14
    }
  ]
}

pattern分詞器是正則分詞，採用\W+,即非字母的符號進行分詞，如上 %haha %、空格、逗號、句號均爲非字母字符，所以進行了分詞

分詞器-analysis-icu

GET _analyze
{
  "analyzer": "icu_analyzer",
  "text": "八百標兵奔北坡"
}

分詞結果

{
  "tokens" : [
    {
      "token" : "八百",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "標兵",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "奔北",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "坡",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    }
  ]
}

icu分詞會根據中文磁性進行分詞，推薦兩個分詞，ik,thula

分詞器-ik

安裝：
1. 進入es的bin目錄執行
```
elasticsearch-plugin list
```
  ，查看當前已有插件
2. 若沒有analysis-ik分詞插件則下載安裝與es相同版本或者高於es版本的插件，低版本的安裝會報錯
```
elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.0/elasticsearch-analysis-ik-7.4.0.zip
```

使用

ik_smart

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "中華人民共和國"
}

分詞結果

{
  "tokens" : [
    {
      "token" : "中華人民共和國",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    }
  ]
}

ik_smart會根據最粗可顆粒度拆分，如中華人民共和國，會拆分爲中華人民共和國，適合pharse短語查詢

ik_max_word

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "中華人民共和國"
}

分詞如下

{
  "tokens" : [
    {
      "token" : "中華人民共和國",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中華人民",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "中華",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "華人",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "人民共和國",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "人民",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "共和國",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "共和",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "國",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 8
    }
  ]
}

ik_max_word會根據最細顆粒度分詞，適合term-query查詢。

2019-10-23更

bulk批量處理

POST _bulk
{"index":{"_index":"user","_id":1}}
{"name":"PHPer"}
{"create":{"_index":"user1","_id":1}}
{"name":"Gopher"}
{"update":{"_index":"user1","_id":1}}
{"doc":{"name":"PHPer"}}
{"delete":{"_index":"user1","_id":1}}

index: 創建，如果已經存在，則刪除已有的保存新的，並且版本號+1。而create發現id已經存在，如果再創建會報錯，update要指定修改數據是對doc進行操作的，以下是返回結果

{
  "took" : 21,
  "errors" : false,
  "items" : [
    {
      "index" : {
        "_index" : "user",
        "_type" : "_doc",
        "_id" : "1",
        "_version" : 26,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 25,
        "_primary_term" : 1,
        "status" : 200
      }
    },
    {
      "create" : {
        "_index" : "user1",
        "_type" : "_doc",
        "_id" : "1",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 12,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "update" : {
        "_index" : "user1",
        "_type" : "_doc",
        "_id" : "1",
        "_version" : 2,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 13,
        "_primary_term" : 1,
        "status" : 200
      }
    },
    {
      "delete" : {
        "_index" : "user1",
        "_type" : "_doc",
        "_id" : "1",
        "_version" : 3,
        "result" : "deleted",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 14,
        "_primary_term" : 1,
        "status" : 200
      }
    }
  ]
}

我們能看到，bluk每條都會產生一個反饋，反覆執行會發現create操作是處於報錯狀態的，因爲要操作的文檔在es中已經存在了。而index、update、delete操作都能正常執行且版本號+1.

ES7.x，相關摘要【更新完畢，更新至分詞器】

前言：

2019-10-12

2019-10-23更

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

ES7.x 摘要【搜索相關要點，完】

ES7.x 聚合相關【聚合&索引遷移、重建、別名】

php-elasticsearch使用時的踩坑【完結】

ES7.x，相關摘要【更新完畢，更新至分詞器】

ES.ingest

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結