Elasticsearch 2.2.0 分詞篇：中文分詞頂原薦

在Elasticsearch中，內置了很多分詞器（analyzers），但默認的分詞器對中文的支持都不是太好。所以需要單獨安裝插件來支持，比較常用的是中科院 ICTCLAS的smartcn和IKAnanlyzer效果還是不錯的，但是目前IKAnanlyzer還不支持最新的Elasticsearch2.2.0版本，但是smartcn中文分詞器默認官方支持，它提供了一箇中文或混合中文英文文本的分析器。支持最新的2.2.0版本版本。但是smartcn不支持自定義詞庫，作爲測試可先用一下。後面的部分介紹如何支持最新的版本。

smartcn

安裝分詞：plugin install analysis-smartcn

卸載：plugin remove analysis-smartcn

測試：

請求：POST http://127.0.0.1:9200/_analyze/

{
  "analyzer": "smartcn",
  "text": "聯想是全球最大的筆記本廠商"
}

返回結果：

{
    "tokens": [
        {
            "token": "聯想", 
            "start_offset": 0, 
            "end_offset": 2, 
            "type": "word", 
            "position": 0
        }, 
        {
            "token": "是", 
            "start_offset": 2, 
            "end_offset": 3, 
            "type": "word", 
            "position": 1
        }, 
        {
            "token": "全球", 
            "start_offset": 3, 
            "end_offset": 5, 
            "type": "word", 
            "position": 2
        }, 
        {
            "token": "最", 
            "start_offset": 5, 
            "end_offset": 6, 
            "type": "word", 
            "position": 3
        }, 
        {
            "token": "大", 
            "start_offset": 6, 
            "end_offset": 7, 
            "type": "word", 
            "position": 4
        }, 
        {
            "token": "的", 
            "start_offset": 7, 
            "end_offset": 8, 
            "type": "word", 
            "position": 5
        }, 
        {
            "token": "筆記本", 
            "start_offset": 8, 
            "end_offset": 11, 
            "type": "word", 
            "position": 6
        }, 
        {
            "token": "廠商", 
            "start_offset": 11, 
            "end_offset": 13, 
            "type": "word", 
            "position": 7
        }
    ]
}

作爲對比，我們看一下標準的分詞的結果，在請求中巴smartcn，換成standard

然後看返回結果：

{
    "tokens": [
        {
            "token": "聯", 
            "start_offset": 0, 
            "end_offset": 1, 
            "type": "<IDEOGRAPHIC>", 
            "position": 0
        }, 
        {
            "token": "想", 
            "start_offset": 1, 
            "end_offset": 2, 
            "type": "<IDEOGRAPHIC>", 
            "position": 1
        }, 
        {
            "token": "是", 
            "start_offset": 2, 
            "end_offset": 3, 
            "type": "<IDEOGRAPHIC>", 
            "position": 2
        }, 
        {
            "token": "全", 
            "start_offset": 3, 
            "end_offset": 4, 
            "type": "<IDEOGRAPHIC>", 
            "position": 3
        }, 
        {
            "token": "球", 
            "start_offset": 4, 
            "end_offset": 5, 
            "type": "<IDEOGRAPHIC>", 
            "position": 4
        }, 
        {
            "token": "最", 
            "start_offset": 5, 
            "end_offset": 6, 
            "type": "<IDEOGRAPHIC>", 
            "position": 5
        }, 
        {
            "token": "大", 
            "start_offset": 6, 
            "end_offset": 7, 
            "type": "<IDEOGRAPHIC>", 
            "position": 6
        }, 
        {
            "token": "的", 
            "start_offset": 7, 
            "end_offset": 8, 
            "type": "<IDEOGRAPHIC>", 
            "position": 7
        }, 
        {
            "token": "筆", 
            "start_offset": 8, 
            "end_offset": 9, 
            "type": "<IDEOGRAPHIC>", 
            "position": 8
        }, 
        {
            "token": "記", 
            "start_offset": 9, 
            "end_offset": 10, 
            "type": "<IDEOGRAPHIC>", 
            "position": 9
        }, 
        {
            "token": "本", 
            "start_offset": 10, 
            "end_offset": 11, 
            "type": "<IDEOGRAPHIC>", 
            "position": 10
        }, 
        {
            "token": "廠", 
            "start_offset": 11, 
            "end_offset": 12, 
            "type": "<IDEOGRAPHIC>", 
            "position": 11
        }, 
        {
            "token": "商", 
            "start_offset": 12, 
            "end_offset": 13, 
            "type": "<IDEOGRAPHIC>", 
            "position": 12
        }
    ]
}

從中可以看出，基本上不能使用，就是一個漢字變成了一個詞了。

本文由賽克藍德(secisland)原創，轉載請標明作者和出處。

IKAnanlyzer支持2.2.0版本

目前github上最新的版本只支持Elasticsearch2.1.1,路徑爲https://github.com/medcl/elasticsearch-analysis-ik。但現在最新的Elasticsearch已經到2.2.0了所以要經過處理一下才能支持。

1、下載源碼，下載完後解壓到任意目錄，然後修改elasticsearch-analysis-ik-master目錄下的pom.xml文件。找到<elasticsearch.version>行，然後把後面的版本號修改成2.2.0。

2、編譯代碼mvn package。

3、編譯完成後會在target\releases生成elasticsearch-analysis-ik-1.7.0.zip文件。

4、解壓文件到Elasticsearch/plugins目錄下。

5、修改配置文件增加一行：index.analysis.analyzer.ik.type : "ik"

6、重啓Elasticsearch。

測試：和上面的請求一樣，只是把分詞替換成ik

返回的結果：

{
    "tokens": [
        {
            "token": "聯想", 
            "start_offset": 0, 
            "end_offset": 2, 
            "type": "CN_WORD", 
            "position": 0
        }, 
        {
            "token": "全球", 
            "start_offset": 3, 
            "end_offset": 5, 
            "type": "CN_WORD", 
            "position": 1
        }, 
        {
            "token": "最大", 
            "start_offset": 5, 
            "end_offset": 7, 
            "type": "CN_WORD", 
            "position": 2
        }, 
        {
            "token": "筆記本", 
            "start_offset": 8, 
            "end_offset": 11, 
            "type": "CN_WORD", 
            "position": 3
        }, 
        {
            "token": "筆記", 
            "start_offset": 8, 
            "end_offset": 10, 
            "type": "CN_WORD", 
            "position": 4
        }, 
        {
            "token": "筆", 
            "start_offset": 8, 
            "end_offset": 9, 
            "type": "CN_WORD", 
            "position": 5
        }, 
        {
            "token": "記", 
            "start_offset": 9, 
            "end_offset": 10, 
            "type": "CN_CHAR", 
            "position": 6
        }, 
        {
            "token": "本廠", 
            "start_offset": 10, 
            "end_offset": 12, 
            "type": "CN_WORD", 
            "position": 7
        }, 
        {
            "token": "廠商", 
            "start_offset": 11, 
            "end_offset": 13, 
            "type": "CN_WORD", 
            "position": 8
        }
    ]
}

從中可以看出，兩個分詞器分詞的結果還是有區別的。

擴展詞庫，在config\ik\custom下在mydict.dic中增加需要的詞組，然後重啓Elasticsearch，需要注意的是文件編碼是UTF-8 無BOM格式編碼。

比如增加了賽克藍德單詞。然後再次查詢：

請求：POST http://127.0.0.1:9200/_analyze/

參數：

{
  "analyzer": "ik",
  "text": "賽克藍德是一家數據安全公司"
}

返回結果：

{
    "tokens": [
        {
            "token": "賽克藍德", 
            "start_offset": 0, 
            "end_offset": 4, 
            "type": "CN_WORD", 
            "position": 0
        }, 
        {
            "token": "克", 
            "start_offset": 1, 
            "end_offset": 2, 
            "type": "CN_WORD", 
            "position": 1
        }, 
        {
            "token": "藍", 
            "start_offset": 2, 
            "end_offset": 3, 
            "type": "CN_WORD", 
            "position": 2
        }, 
        {
            "token": "德", 
            "start_offset": 3, 
            "end_offset": 4, 
            "type": "CN_CHAR", 
            "position": 3
        }, 
        {
            "token": "一家", 
            "start_offset": 5, 
            "end_offset": 7, 
            "type": "CN_WORD", 
            "position": 4
        }, 
        {
            "token": "一", 
            "start_offset": 5, 
            "end_offset": 6, 
            "type": "TYPE_CNUM", 
            "position": 5
        }, 
        {
            "token": "家", 
            "start_offset": 6, 
            "end_offset": 7, 
            "type": "COUNT", 
            "position": 6
        }, 
        {
            "token": "數據", 
            "start_offset": 7, 
            "end_offset": 9, 
            "type": "CN_WORD", 
            "position": 7
        }, 
        {
            "token": "安全", 
            "start_offset": 9, 
            "end_offset": 11, 
            "type": "CN_WORD", 
            "position": 8
        }, 
        {
            "token": "公司", 
            "start_offset": 11, 
            "end_offset": 13, 
            "type": "CN_WORD", 
            "position": 9
        }
    ]
}

從上面的結果可以看出已經支持賽克藍德單詞了。

賽克藍德(secisland)後續會逐步對Elasticsearch的最新版本的各項功能進行分析，近請期待。也歡迎加入secisland公衆號進行關注。

Elasticsearch 2.2.0 分詞篇：中文分詞頂原薦

smartcn

IKAnanlyzer支持2.2.0版本

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

WebStorm 創建 Vue 項目

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

SeciLog 1.3.6 發佈，新增了自定義日誌解析等功能原

elasticsearch5.0.0分配的變化Http協議和REST接口的變化原薦

elasticsearch5.0.0中插件的變化，文件系統和路徑的變化原

工作感悟20170922 原

運維中被低估的日誌頂原薦

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Elasticsearch 2.2.0 分詞篇：中文分詞 頂 原 薦

smartcn

IKAnanlyzer支持2.2.0版本

Elasticsearch 2.2.0 分詞篇：中文分詞頂原薦