在Elasticsearch中,內置了很多分詞器(analyzers),但默認的分詞器對中文的支持都不是太好。所以需要單獨安裝插件來支持,比較常用的是中科院 ICTCLAS的smartcn和IKAnanlyzer效果還是不錯的,但是目前IKAnanlyzer還不支持最新的Elasticsearch2.2.0版本,但是smartcn中文分詞器默認官方支持,它提供了一箇中文或混合中文英文文本的分析器。支持最新的2.2.0版本版本。但是smartcn不支持自定義詞庫,作爲測試可先用一下。後面的部分介紹如何支持最新的版本。
smartcn
安裝分詞:plugin install analysis-smartcn
卸載:plugin remove analysis-smartcn
測試:
請求:POST http://127.0.0.1:9200/_analyze/
{
"analyzer": "smartcn",
"text": "聯想是全球最大的筆記本廠商"
}
返回結果:
{
"tokens": [
{
"token": "聯想",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "是",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "全球",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 2
},
{
"token": "最",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 3
},
{
"token": "大",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 4
},
{
"token": "的",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 5
},
{
"token": "筆記本",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 6
},
{
"token": "廠商",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 7
}
]
}
作爲對比,我們看一下標準的分詞的結果,在請求中巴smartcn,換成standard
然後看返回結果:
{
"tokens": [
{
"token": "聯",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "想",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "是",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "全",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "球",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "最",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "大",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
},
{
"token": "的",
"start_offset": 7,
"end_offset": 8,
"type": "<IDEOGRAPHIC>",
"position": 7
},
{
"token": "筆",
"start_offset": 8,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 8
},
{
"token": "記",
"start_offset": 9,
"end_offset": 10,
"type": "<IDEOGRAPHIC>",
"position": 9
},
{
"token": "本",
"start_offset": 10,
"end_offset": 11,
"type": "<IDEOGRAPHIC>",
"position": 10
},
{
"token": "廠",
"start_offset": 11,
"end_offset": 12,
"type": "<IDEOGRAPHIC>",
"position": 11
},
{
"token": "商",
"start_offset": 12,
"end_offset": 13,
"type": "<IDEOGRAPHIC>",
"position": 12
}
]
}
從中可以看出,基本上不能使用,就是一個漢字變成了一個詞了。
本文由賽克 藍德(secisland)原創,轉載請標明作者和出處。
IKAnanlyzer支持2.2.0版本
目前github上最新的版本只支持Elasticsearch2.1.1,路徑爲https://github.com/medcl/elasticsearch-analysis-ik。但現在最新的Elasticsearch已經到2.2.0了所以要經過處理一下才能支持。
1、下載源碼,下載完後解壓到任意目錄,然後修改elasticsearch-analysis-ik-master目錄下的pom.xml文件。找到<elasticsearch.version>行,然後把後面的版本號修改成2.2.0。
2、編譯代碼mvn package。
3、編譯完成後會在target\releases生成elasticsearch-analysis-ik-1.7.0.zip文件。
4、解壓文件到Elasticsearch/plugins目錄下。
5、修改配置文件增加一行:index.analysis.analyzer.ik.type : "ik"
6、重啓Elasticsearch。
測試:和上面的請求一樣,只是把分詞替換成ik
返回的結果:
{
"tokens": [
{
"token": "聯想",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "全球",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 1
},
{
"token": "最大",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 2
},
{
"token": "筆記本",
"start_offset": 8,
"end_offset": 11,
"type": "CN_WORD",
"position": 3
},
{
"token": "筆記",
"start_offset": 8,
"end_offset": 10,
"type": "CN_WORD",
"position": 4
},
{
"token": "筆",
"start_offset": 8,
"end_offset": 9,
"type": "CN_WORD",
"position": 5
},
{
"token": "記",
"start_offset": 9,
"end_offset": 10,
"type": "CN_CHAR",
"position": 6
},
{
"token": "本廠",
"start_offset": 10,
"end_offset": 12,
"type": "CN_WORD",
"position": 7
},
{
"token": "廠商",
"start_offset": 11,
"end_offset": 13,
"type": "CN_WORD",
"position": 8
}
]
}
從中可以看出,兩個分詞器分詞的結果還是有區別的。
擴展詞庫,在config\ik\custom下在mydict.dic中增加需要的詞組,然後重啓Elasticsearch,需要注意的是文件編碼是UTF-8 無BOM格式編碼。
比如增加了賽克藍德單詞。然後再次查詢:
請求:POST http://127.0.0.1:9200/_analyze/
參數:
{
"analyzer": "ik",
"text": "賽克藍德是一家數據安全公司"
}
返回結果:
{
"tokens": [
{
"token": "賽克藍德",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 0
},
{
"token": "克",
"start_offset": 1,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "藍",
"start_offset": 2,
"end_offset": 3,
"type": "CN_WORD",
"position": 2
},
{
"token": "德",
"start_offset": 3,
"end_offset": 4,
"type": "CN_CHAR",
"position": 3
},
{
"token": "一家",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 4
},
{
"token": "一",
"start_offset": 5,
"end_offset": 6,
"type": "TYPE_CNUM",
"position": 5
},
{
"token": "家",
"start_offset": 6,
"end_offset": 7,
"type": "COUNT",
"position": 6
},
{
"token": "數據",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 7
},
{
"token": "安全",
"start_offset": 9,
"end_offset": 11,
"type": "CN_WORD",
"position": 8
},
{
"token": "公司",
"start_offset": 11,
"end_offset": 13,
"type": "CN_WORD",
"position": 9
}
]
}
從上面的結果可以看出已經支持賽克藍德單詞了。
賽克藍德(secisland)後續會逐步對Elasticsearch的最新版本的各項功能進行分析,近請期待。也歡迎加入secisland公衆號進行關注。