一個tokenizer(分詞器)接收一串字符流,將之分割爲獨立的tokens(詞元,通常是獨立的單詞),然後輸出tokens流
例如,whitespace tokenizer遇到空白字符時分割文本,它會將I am zyn分割爲【I、am、zyn】。
該tokenizer(分詞器)還負責記錄各個terms(詞條)的順序或position位置(用於phrase短語和word proximity詞近鄰查詢),以及term(詞條)所代表的原始word(單詞)的start(起始)和end(結束)的character offsets(字符串偏移量)(用於高亮顯示搜索的內容)。elasticsearch提供了很多內置的分詞器(標準分詞器),可以用來構建custom analyzers(自定義分詞器)。
關於分詞器: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/analysis.html
標準分詞器 standard,按空格分
POST _analyze { "tokenizer": "standard", "text": "Hello the world." }
執行結果:
{ "tokens" : [ { "token" : "Hello", "start_offset" : 0, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "the", "start_offset" : 6, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "world", "start_offset" : 10, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 2 } ] }
但是es中默認的分詞器,都是支持英文的,中文需要安裝自己的分詞器
ik分詞器https://github.com/medcl/elasticsearch-analysis-ik/releases
查看es版本以安裝對應版本號的ik分詞器
[vagrant@10 ~]$ curl http://192.168.56.10:9200/ { "name" : "3cafb1a4b1b3", "cluster_name" : "elasticsearch", "cluster_uuid" : "0cNA2l38RFK6LMHislSvNg", "version" : { "number" : "7.4.2", "build_flavor" : "default", "build_type" : "docker", "build_hash" : "2f90bbf7b93631e52bafb59b3b049cb44ec25e96", "build_date" : "2019-10-28T20:40:44.881551Z", "build_snapshot" : false, "lucene_version" : "8.2.0", "minimum_wire_compatibility_version" : "6.8.0", "minimum_index_compatibility_version" : "6.0.0-beta1" }, "tagline" : "You Know, for Search" }
git上看不到v7.4.2版本了,試了下直接輸入地址就下了
https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
下載完後,解壓到es容器內部plugins目錄中(或es的對應的映射目錄)就可以使用了。
XSHELL和xftp正版免費下載參考:https://www.cnblogs.com/qingshan-tang/p/12855807.html
我這下載太慢,轉而又回去命令安裝了===!!!
vagrant ssh連接虛擬機後,su root轉管理員,安裝.wget
vagrant ssh
su root
vagrant
yum install wget
安裝完wget後,轉到es /plugins目錄安裝ik
[root@10 /]# cd /mydata/elasticsearch/plugins/ [root@10 plugins]# wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
安裝unzip解壓命令,然後建ik目錄,解壓到ik中,然後刪除解壓過的zip
yum install unzip mkdir ik cd ik unzip elasticsearch-analysis-ik-7.4.2.zip rm elasticsearch-analysis-ik-7.4.2.zip
裝ik文件夾可讀可寫可執行
[root@10 plugins]# chmod -R 777 ik/
查看ik分詞器是否安裝完成:
進入es控制檯:
docker exec -it 3caf /bin/bash
顯示ik證明ik分詞器安裝成功了,然後退出es容器重啓es
試下效果:
ik_smart:智能分詞
POST _analyze { "tokenizer": "ik_smart", "text": "我是中國人" }
結果:
{ "tokens" : [ { "token" : "我", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "是", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "中國人", "start_offset" : 2, "end_offset" : 5, "type" : "CN_WORD", "position" : 2 } ] }
ik_max_word:最大單詞組合
POST _analyze { "tokenizer": "ik_max_word", "text": "我是中國人" }
結果:
{ "tokens" : [ { "token" : "我", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "是", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "中國人", "start_offset" : 2, "end_offset" : 5, "type" : "CN_WORD", "position" : 2 }, { "token" : "中國", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 3 }, { "token" : "國人", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4 } ] }