Elasticsearch自帶的分詞器效果不佳,因此可以IK分詞器來完成分詞操作。
IK分詞器帶有兩種analyer:
ik_max_word: 會將文本做最細粒度的拆分,比如會將“中華人民共和國國歌”拆分爲“中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌”,會窮盡各種可能的組合;
ik_smart: 會做最粗粒度的拆分,比如會將“中華人民共和國國歌”拆分爲“中華人民共和國,國歌”。
1.安裝maven
將壓縮包上傳至master節點,並解壓到/opt/module/目錄下。
2.配置setting文件
<?xml version="1.0" encoding="UTF-8"?>
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
<!—倉庫地址-->
<localRepository> /opt/module/apache-maven-3.0.5/repository</localRepository>
<pluginGroups> </pluginGroups>
<proxies> </proxies>
<servers> </servers>
<mirrors>
<mirror>
<id>nexus-aliyun</id>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>
<mirror>
<id>repo2</id>
<name>Mirror from Maven Repo2</name>
<url>http://repo2.maven.org/maven2/</url>
<mirrorOf>central</mirrorOf>
</mirror>
<mirror>
<id>centor</id>
<name>Mirror from Maven central</name>
<url>http://central.maven.org/maven2/</url>
<mirrorOf>central</mirrorOf>
</mirror>
</mirrors>
<profiles>
<profile>
<id>jdk-1.8</id>
<activation>
<activeByDefault>true</activeByDefault>
<jdk>1.8</jdk>
</activation>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<maven.compiler.compilerVersion>1.8</maven.compiler.compilerVersion>
</properties>
</profile>
</profiles>
</settings>
3.創建倉庫目錄
[dendan@master apache-maven-3.0.5]$ mkdir repository
4.編輯profile
使用root用戶進行編輯
vi /etc/profile
內容如下:
# maven
export MAVEN_HOME=/opt/module/apache-maven-3.0.5
export PATH=$PATH:$MAVEN_HOME/bin
重新加載profile
source /etc/profile
5.解壓IK分詞器
[dendan@master software]$ unzip elasticsearch-analysis-ik-master.zip
6.進行打包編譯
[dendan@master software]$ cd elasticsearch-analysis-ik-master
[dendan@master elasticsearch-analysis-ik-master]$ mvn package -Pdist,native -DskipTests -Dtar
7.解壓並拷貝打包後的文件
[dendan@master releases]$ pwd
/opt/software/elasticsearch-analysis-ik-master/target/releases
[dendan@master releases]$ ll
總用量 4400
-rw-rw-r--. 1 dendan dendan 4502368 6月 24 10:21 elasticsearch-analysis-ik-5.6.1.zip
[dendan@master releases]$ unzip elasticsearch-analysis-ik-5.6.1.zip
[dendan@master releases]$ ll
總用量 4404
drwxrwxrwx. 3 dendan dendan 4096 6月 24 10:21 elasticsearch
8.安裝IK分詞器插件
將elasticsearch 目錄移到 elasticsearch/plugins目錄。
[dendan@master releases]$ cp -r elasticsearch /opt/module/elasticsearch-5.6.1/plugins/
9.啓動elasticsearch
[dendan@master elasticsearch-5.6.1]$ bin/elasticsearch
在日誌中會出現:
[2020-06-24T10:50:12,762][INFO ][o.e.p.PluginsService ] [node-111] loaded plugin [analysis-ik]
10.IK分詞器測試
smart模式:
它會將句子儘量切分成儘量少的詞。
[dendan@master elasticsearch-5.6.1]$ curl -XGET 'http://master:9200/_analyze?pretty&analyzer=ik_smart' -d '它提供了一個分佈式多用戶能力的全文搜索引擎,基於RESTful web接口'
結果如下:
{
"tokens" : [
{
"token" : "它",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "提供",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "了",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "一個",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "分佈式",
"start_offset" : 6,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "多用戶",
"start_offset" : 9,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "能力",
"start_offset" : 12,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "的",
"start_offset" : 14,
"end_offset" : 15,
"type" : "CN_CHAR",
"position" : 7
},
{
"token" : "全文",
"start_offset" : 15,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 8
},
{
"token" : "搜索引擎",
"start_offset" : 17,
"end_offset" : 21,
"type" : "CN_WORD",
"position" : 9
},
{
"token" : "基於",
"start_offset" : 22,
"end_offset" : 24,
"type" : "CN_WORD",
"position" : 10
},
{
"token" : "restful",
"start_offset" : 24,
"end_offset" : 31,
"type" : "ENGLISH",
"position" : 11
},
{
"token" : "web",
"start_offset" : 32,
"end_offset" : 35,
"type" : "ENGLISH",
"position" : 12
},
{
"token" : "接口",
"start_offset" : 35,
"end_offset" : 37,
"type" : "CN_WORD",
"position" : 13
}
]
}
max_word模式:
該模式下,會將句子切分成儘量多的詞。
[dendan@master elasticsearch-5.6.1]$ curl -XGET 'http://master:9200/_analyze?pretty&analyzer=ik_max_word' -d '它提供了一個分佈式多用戶能力的全文搜索引擎,基於RESTful web接口'
{
"tokens" : [
{
"token" : "它",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "提供",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "了",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "一個",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "一",
"start_offset" : 4,
"end_offset" : 5,
"type" : "TYPE_CNUM",
"position" : 4
},
{
"token" : "個",
"start_offset" : 5,
"end_offset" : 6,
"type" : "COUNT",
"position" : 5
},
{
"token" : "分佈式",
"start_offset" : 6,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "分佈",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "式",
"start_offset" : 8,
"end_offset" : 9,
"type" : "CN_CHAR",
"position" : 8
},
{
"token" : "多用戶",
"start_offset" : 9,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 9
},
{
"token" : "多用",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 10
},
{
"token" : "用戶",
"start_offset" : 10,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 11
},
{
"token" : "能力",
"start_offset" : 12,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 12
},
{
"token" : "的",
"start_offset" : 14,
"end_offset" : 15,
"type" : "CN_CHAR",
"position" : 13
},
{
"token" : "全文",
"start_offset" : 15,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 14
},
{
"token" : "搜索引擎",
"start_offset" : 17,
"end_offset" : 21,
"type" : "CN_WORD",
"position" : 15
},
{
"token" : "搜索",
"start_offset" : 17,
"end_offset" : 19,
"type" : "CN_WORD",
"position" : 16
},
{
"token" : "索引",
"start_offset" : 18,
"end_offset" : 20,
"type" : "CN_WORD",
"position" : 17
},
{
"token" : "引擎",
"start_offset" : 19,
"end_offset" : 21,
"type" : "CN_WORD",
"position" : 18
},
{
"token" : "基於",
"start_offset" : 22,
"end_offset" : 24,
"type" : "CN_WORD",
"position" : 19
},
{
"token" : "restful",
"start_offset" : 24,
"end_offset" : 31,
"type" : "ENGLISH",
"position" : 20
},
{
"token" : "web",
"start_offset" : 32,
"end_offset" : 35,
"type" : "ENGLISH",
"position" : 21
},
{
"token" : "接口",
"start_offset" : 35,
"end_offset" : 37,
"type" : "CN_WORD",
"position" : 22
}
]
}
javaAPI測試
/**
* IK分詞器映射
*
* @throws Exception
*/
@Test
public void createMappingIk() throws Exception {
// 1設置mapping
XContentBuilder builder = XContentFactory.jsonBuilder()
.startObject()
.startObject("article")
.startObject("properties")
.startObject("id1")
.field("type", "string")
.field("store", "yes")
.field("analyzer","ik_smart")// 設置分詞器
.endObject()
.startObject("title2")
.field("type", "string")
.field("store", "no")
.field("analyzer","ik_smart")// 設置分詞器
.endObject()
.startObject("content")
.field("type", "string")
.field("store", "yes")
.field("analyzer","ik_smart")// 設置分詞器
.endObject()
.endObject()
.endObject()
.endObject();
// 2 添加mapping
PutMappingRequest mapping = Requests.putMappingRequest("blog2").type("article").source(builder);
client.admin().indices().putMapping(mapping).get();
// 3 關閉資源
client.close();
}