分佈式搜索引擎ElasticSearch

什麼是IK分詞器

默認的中文分詞是將每個字看成一個詞，這顯然是不符合要求的，所以我們需要安裝中
文分詞器來解決這個問題。
IK分詞是一款國人開發的相對簡單的中文分詞器。雖然開發者自2012年之後就不在維護
了，但在工程應用中IK算是比較流行的一款！我們今天就介紹一下IK中文分詞器的使用。

IK分詞器安裝

下載地址：https://github.com/medcl/elasticsearch-analysis-ik/releases 下載5.6.8版
版本需要和elasticsearch版本一樣即可。

先將其解壓，將解壓後的elasticsearch文件夾重命名文件夾爲ik。
將ik文件夾拷貝到elasticsearch/plugins 目錄下。
重新啓動，即可加載IK分詞器。

IK分詞器測試

IK提供了兩個分詞算法 ik_smart 和 ik_max_word
其中 ik_smart 爲最少切分， ik_max_word爲最細粒度劃分
我們分別來試一下
（1）最小切分：在瀏覽器地址欄輸入地址

http://127.0.0.1:9200/_analyze?analyzer=ik_smart&pretty=true&text=IK分詞器測試

輸出的結果爲

{
 "tokens": [
     {
         "token": "ik",
         "start_offset": 0,
         "end_offset": 2,
         "type": "ENGLISH",
         "position": 0
     },
     {
         "token": "分詞器",
         "start_offset": 2,
         "end_offset": 5,
         "type": "CN_WORD",
         "position": 1
     },
     {
         "token": "測試",
         "start_offset": 5,
         "end_offset": 7,
         "type": "CN_WORD",
         "position": 2
     }
 ]
}

最細切分：在瀏覽器地址欄輸入地址

http://127.0.0.1:9200/_analyze?analyzer=ik_max_word&pretty=true&text=IK分詞器測試

{
 "tokens": [
     {
         "token": "ik",
         "start_offset": 0,
         "end_offset": 2,
         "type": "ENGLISH",
         "position": 0
     },
     {
         "token": "分詞器",
         "start_offset": 2,
         "end_offset": 5,
         "type": "CN_WORD",
         "position": 1
     },
     {
         "token": "分詞",
         "start_offset": 2,
         "end_offset": 4,
         "type": "CN_WORD",
         "position": 2
     },
     {
         "token": "器",
         "start_offset": 4,
         "end_offset": 5,
         "type": "CN_CHAR",
         "position": 3
     },
     {
         "token": "測試",
         "start_offset": 5,
         "end_offset": 7,
         "type": "CN_WORD",
         "position": 4
     }
 ]
}

自定義詞庫

我們現在測試"廖權名博客"，瀏覽器的測試效果如下：
http://127.0.0.1:9200/_analyze?analyzer=ik_smart&pretty=true&text=廖權名博客

{
 "tokens": [
     {
         "token": "廖",
         "start_offset": 0,
         "end_offset": 1,
         "type": "CN_CHAR",
         "position": 0
     },
     {
         "token": "權",
         "start_offset": 1,
         "end_offset": 2,
         "type": "CN_CHAR",
         "position": 1
     },
     {
         "token": "名",
         "start_offset": 2,
         "end_offset": 3,
         "type": "CN_CHAR",
         "position": 2
     },
     {
         "token": "博客",
         "start_offset": 3,
         "end_offset": 5,
         "type": "CN_WORD",
         "position": 3
     }
 ]
}

默認的分詞並沒有識別“廖權名”是一個詞。如果我們想讓系統識別“傳智播客”是一個
詞，需要編輯自定義詞庫。
步驟：
（1）進入elasticsearch/plugins/ik/config目錄
（2）新建一個my.dic文件，編輯內容：

廖權名

修改IKAnalyzer.cfg.xml（在ik/config目錄下）

<properties>
 <comment>IK Analyzer 擴展配置</comment>
 <!‐‐用戶可以在這裏配置自己的擴展字典 ‐‐>
 <entry key="ext_dict">my.dic</entry>
 <!‐‐用戶可以在這裏配置自己的擴展停止詞字典‐‐>
 <entry key="ext_stopwords"></entry>
</properties>

重新啓動elasticsearch,通過瀏覽器測試分詞效果

{
  "tokens" : [
    {
      "token" : "廖權名",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    }
  ]
}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

四、分佈式搜索引擎ElasticSearch——IK分詞器

分佈式搜索引擎ElasticSearch

什麼是IK分詞器

IK分詞器安裝

IK分詞器測試

自定義詞庫

九、分佈式搜索引擎ElasticSearch——原理3

Jenkins 持續集成實例（maven&node自定義實踐）

四、分佈式搜索引擎ElasticSearch——IK分詞器

一、分佈式搜索引擎ElasticSearch——基本介紹

Docker 學習之基本概念

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結