ES7.x,相關摘要【更新完畢,更新至分詞器】

前言:

現在是2019.10.11,最近工作比較忙,小竈時間比較少,現在工作結束,可以繼續學習了,敲開心!

  1. index與create的區別:  index的功能比create強一點,也是爲什麼廣泛使用的原因,他的作用是如果文檔不存在,則索引新的文檔,如果文檔已經存在,則會刪除現有文檔,新的文檔會被索引,並且版本號verson會被+1。這點和update還是有區別的。
  2. index與update的卻別: update方法不會像index一樣刪除原來的文檔,而是實現真正的數據更新,但是如果使用update方法,在請求體body中就要指明doc字段,例如
    POST user/_update/1
    {
        "doc":{
            "name":"xxx"
        }
    }

    下面是簡單的crud操作

    POST user/_doc
    {
      "user":"Mike",
      "post_date" : "2019-10-11 17:44:00",
      "message" : "trying out kibana"
    }
    
    PUT user/_doc/1     //這裏會默認使用index方式
    {
      "user":"Mike",
      "post_date" : "2019-10-11 17:44:00",
      "message" : "trying out kibana"
    }
    
    GET user/_doc/1
    
    //指定方式,因爲mike之前已經創建過了,又使用create方式,所以會報錯
    PUT user/_doc/1?op_type=create   
    {
       "user":"Mike",
      "post_date" : "2019-10-11 17:44:00",
      "message" : "trying out kibana"
    }

     

2019-10-12

  1. 正排索引:舉個栗子,書本的章節與目錄的關係,看到第幾頁,你就知道第幾章了,這就是正派索引,在搜索引擎中對應的就是-文檔id與文檔內容和單詞的關聯。
  2. 倒排索引: 舉個栗子,書本的單詞,出現在第幾頁,根據單詞,你就知道所在頁面,在搜索引擎中就是單詞到文檔id的對應關係。

          上圖左側是正排索引,右側爲倒排索引

  1.        倒排索引有兩個部分
    1. 單詞詞典:記錄文檔所有的單詞,以及單詞與倒排列表的關聯關係。(單詞詞典一般比較大,可以使用B+樹或哈希拉鍊法來實現高性能的插入與查詢)
    2. 倒排列表:記錄了單詞與對應文檔的結合,由倒排索引項組成。倒排索引項包含一下幾點:
      1. 文檔id
      2. 詞頻:TF,一個單詞在文檔中出現的次數,用於相關性評分。
      3. 位置:單詞在文檔中出現的位置,用於語句搜索,(phrase query)
      4. 偏移:記錄單詞的開始結束位置,實現高亮顯示。
  2.  分詞-standard
    GET _analyze
    {
      "analyzer": "standard",
      "text": "i am PHPerJiang"
    }

    es默認的分詞器是standard,以下是分詞結果

    {
      "tokens" : [
        {
          "token" : "i",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "am",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "phperjiang",
          "start_offset" : 5,
          "end_offset" : 15,
          "type" : "<ALPHANUM>",
          "position" : 2
        }
      ]
    }

    standard分詞器會去除符號,並將大寫轉換爲小寫,然後根據空格進行分詞

  3. 分詞-whitespace

    GET _analyze
    {
      "analyzer": "whitespace",
      "text": "33 i am PHPer-jiang,i am so good。"
    }

    分詞結果

    {
      "tokens" : [
        {
          "token" : "33",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "i",
          "start_offset" : 3,
          "end_offset" : 4,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "am",
          "start_offset" : 5,
          "end_offset" : 7,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "PHPer-jiang,i",
          "start_offset" : 8,
          "end_offset" : 21,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "am",
          "start_offset" : 22,
          "end_offset" : 24,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "so",
          "start_offset" : 25,
          "end_offset" : 27,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "good。",
          "start_offset" : 28,
          "end_offset" : 33,
          "type" : "word",
          "position" : 6
        }
      ]
    }
    

    whitespace分詞器只根據空格進行分詞,保留符號

  4. 分詞-stop

    GET _analyze
    {
      "analyzer": "stop",
      "text": "33 i am PHPer-jiang,i am so good。the history is new history"
    }

    分詞結果

    {
      "tokens" : [
        {
          "token" : "i",
          "start_offset" : 3,
          "end_offset" : 4,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "am",
          "start_offset" : 5,
          "end_offset" : 7,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "phper",
          "start_offset" : 8,
          "end_offset" : 13,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "jiang",
          "start_offset" : 14,
          "end_offset" : 19,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "i",
          "start_offset" : 20,
          "end_offset" : 21,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "am",
          "start_offset" : 22,
          "end_offset" : 24,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "so",
          "start_offset" : 25,
          "end_offset" : 27,
          "type" : "word",
          "position" : 6
        },
        {
          "token" : "good",
          "start_offset" : 28,
          "end_offset" : 32,
          "type" : "word",
          "position" : 7
        },
        {
          "token" : "history",
          "start_offset" : 37,
          "end_offset" : 44,
          "type" : "word",
          "position" : 9
        },
        {
          "token" : "new",
          "start_offset" : 48,
          "end_offset" : 51,
          "type" : "word",
          "position" : 11
        },
        {
          "token" : "history",
          "start_offset" : 52,
          "end_offset" : 59,
          "type" : "word",
          "position" : 12
        }
      ]
    }
    

    stop分詞器與standard分詞器相比,會過濾掉 the,is,in 等修飾性詞語,去除符號和數字,然後進行分詞,同樣是大寫轉爲小寫

  5. 分詞-keyword

    GET _analyze
    {
      "analyzer": "keyword",
      "text": "33 i am PHPer-jiang,i am so good。the history is new history"
    }

    分詞結果

    {
      "tokens" : [
        {
          "token" : "33 i am PHPer-jiang,i am so good。the history is new history",
          "start_offset" : 0,
          "end_offset" : 59,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    

    keyword分詞器其實不會進行分詞,text當做一個整體分詞。

  6. 分詞-pattern

    GET _analyze
    {
      "analyzer": "pattern",
      "text": "33 i am PHPer-jiang,i am so good。the history is new history % hahah"
    }

    結果如下

    {
      "tokens" : [
        {
          "token" : "33",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "i",
          "start_offset" : 3,
          "end_offset" : 4,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "am",
          "start_offset" : 5,
          "end_offset" : 7,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "phper",
          "start_offset" : 8,
          "end_offset" : 13,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "jiang",
          "start_offset" : 14,
          "end_offset" : 19,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "i",
          "start_offset" : 20,
          "end_offset" : 21,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "am",
          "start_offset" : 22,
          "end_offset" : 24,
          "type" : "word",
          "position" : 6
        },
        {
          "token" : "so",
          "start_offset" : 25,
          "end_offset" : 27,
          "type" : "word",
          "position" : 7
        },
        {
          "token" : "good",
          "start_offset" : 28,
          "end_offset" : 32,
          "type" : "word",
          "position" : 8
        },
        {
          "token" : "the",
          "start_offset" : 33,
          "end_offset" : 36,
          "type" : "word",
          "position" : 9
        },
        {
          "token" : "history",
          "start_offset" : 37,
          "end_offset" : 44,
          "type" : "word",
          "position" : 10
        },
        {
          "token" : "is",
          "start_offset" : 45,
          "end_offset" : 47,
          "type" : "word",
          "position" : 11
        },
        {
          "token" : "new",
          "start_offset" : 48,
          "end_offset" : 51,
          "type" : "word",
          "position" : 12
        },
        {
          "token" : "history",
          "start_offset" : 52,
          "end_offset" : 59,
          "type" : "word",
          "position" : 13
        },
        {
          "token" : "hahah",
          "start_offset" : 62,
          "end_offset" : 67,
          "type" : "word",
          "position" : 14
        }
      ]
    }
    

    pattern分詞器是正則分詞,採用\W+,即非字母的符號進行分詞,如上 %haha  %、空格、逗號、句號均爲非字母字符,所以進行了分詞

  7. 分詞器-analysis-icu

    GET _analyze
    {
      "analyzer": "icu_analyzer",
      "text": "八百標兵奔北坡"
    }

    分詞結果

    {
      "tokens" : [
        {
          "token" : "八百",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "<IDEOGRAPHIC>",
          "position" : 0
        },
        {
          "token" : "標兵",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "<IDEOGRAPHIC>",
          "position" : 1
        },
        {
          "token" : "奔北",
          "start_offset" : 4,
          "end_offset" : 6,
          "type" : "<IDEOGRAPHIC>",
          "position" : 2
        },
        {
          "token" : "坡",
          "start_offset" : 6,
          "end_offset" : 7,
          "type" : "<IDEOGRAPHIC>",
          "position" : 3
        }
      ]
    }
    

    icu分詞會根據中文磁性進行分詞,推薦兩個分詞,ik,thula

  8. 分詞器-ik

    1. 安裝:

      1. 進入es的bin目錄執行

        elasticsearch-plugin list

        ,查看當前已有插件

      2. 若沒有analysis-ik分詞插件則下載安裝與es相同版本或者高於es版本的插件,低版本的安裝會報錯

        elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.0/elasticsearch-analysis-ik-7.4.0.zip

         

    2. 使用

      1. ik_smart

        GET _analyze
        {
          "analyzer": "ik_smart",
          "text": "中華人民共和國"
        }

        分詞結果

        {
          "tokens" : [
            {
              "token" : "中華人民共和國",
              "start_offset" : 0,
              "end_offset" : 7,
              "type" : "CN_WORD",
              "position" : 0
            }
          ]
        }
        

        ik_smart會根據最粗可顆粒度拆分,如中華人民共和國,會拆分爲中華人民共和國,適合pharse短語查詢

      2. ik_max_word

        GET _analyze
        {
          "analyzer": "ik_max_word",
          "text": "中華人民共和國"
        }

        分詞如下

        {
          "tokens" : [
            {
              "token" : "中華人民共和國",
              "start_offset" : 0,
              "end_offset" : 7,
              "type" : "CN_WORD",
              "position" : 0
            },
            {
              "token" : "中華人民",
              "start_offset" : 0,
              "end_offset" : 4,
              "type" : "CN_WORD",
              "position" : 1
            },
            {
              "token" : "中華",
              "start_offset" : 0,
              "end_offset" : 2,
              "type" : "CN_WORD",
              "position" : 2
            },
            {
              "token" : "華人",
              "start_offset" : 1,
              "end_offset" : 3,
              "type" : "CN_WORD",
              "position" : 3
            },
            {
              "token" : "人民共和國",
              "start_offset" : 2,
              "end_offset" : 7,
              "type" : "CN_WORD",
              "position" : 4
            },
            {
              "token" : "人民",
              "start_offset" : 2,
              "end_offset" : 4,
              "type" : "CN_WORD",
              "position" : 5
            },
            {
              "token" : "共和國",
              "start_offset" : 4,
              "end_offset" : 7,
              "type" : "CN_WORD",
              "position" : 6
            },
            {
              "token" : "共和",
              "start_offset" : 4,
              "end_offset" : 6,
              "type" : "CN_WORD",
              "position" : 7
            },
            {
              "token" : "國",
              "start_offset" : 6,
              "end_offset" : 7,
              "type" : "CN_CHAR",
              "position" : 8
            }
          ]
        }
        

        ik_max_word會根據最細顆粒度分詞,適合term-query查詢。

2019-10-23更

  1. bulk批量處理
    POST _bulk
    {"index":{"_index":"user","_id":1}}
    {"name":"PHPer"}
    {"create":{"_index":"user1","_id":1}}
    {"name":"Gopher"}
    {"update":{"_index":"user1","_id":1}}
    {"doc":{"name":"PHPer"}}
    {"delete":{"_index":"user1","_id":1}}

    index: 創建,如果已經存在,則刪除已有的保存新的,並且版本號+1。而create發現id已經存在,如果再創建會報錯,update要指定修改數據是對doc進行操作的,以下是返回結果

    {
      "took" : 21,
      "errors" : false,
      "items" : [
        {
          "index" : {
            "_index" : "user",
            "_type" : "_doc",
            "_id" : "1",
            "_version" : 26,
            "result" : "updated",
            "_shards" : {
              "total" : 2,
              "successful" : 1,
              "failed" : 0
            },
            "_seq_no" : 25,
            "_primary_term" : 1,
            "status" : 200
          }
        },
        {
          "create" : {
            "_index" : "user1",
            "_type" : "_doc",
            "_id" : "1",
            "_version" : 1,
            "result" : "created",
            "_shards" : {
              "total" : 2,
              "successful" : 1,
              "failed" : 0
            },
            "_seq_no" : 12,
            "_primary_term" : 1,
            "status" : 201
          }
        },
        {
          "update" : {
            "_index" : "user1",
            "_type" : "_doc",
            "_id" : "1",
            "_version" : 2,
            "result" : "updated",
            "_shards" : {
              "total" : 2,
              "successful" : 1,
              "failed" : 0
            },
            "_seq_no" : 13,
            "_primary_term" : 1,
            "status" : 200
          }
        },
        {
          "delete" : {
            "_index" : "user1",
            "_type" : "_doc",
            "_id" : "1",
            "_version" : 3,
            "result" : "deleted",
            "_shards" : {
              "total" : 2,
              "successful" : 1,
              "failed" : 0
            },
            "_seq_no" : 14,
            "_primary_term" : 1,
            "status" : 200
          }
        }
      ]
    }
    

    我們能看到,bluk每條都會產生一個反饋,反覆執行會發現create操作是處於報錯狀態的,因爲要操作的文檔在es中已經存在了。而index、update、delete操作都能正常執行且版本號+1.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章