Elasticsearch-IK分詞器

Elasticsearch自帶的分詞器效果不佳,因此可以IK分詞器來完成分詞操作。
IK分詞器帶有兩種analyer:
ik_max_word: 會將文本做最細粒度的拆分,比如會將“中華人民共和國國歌”拆分爲“中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌”,會窮盡各種可能的組合;

ik_smart: 會做最粗粒度的拆分,比如會將“中華人民共和國國歌”拆分爲“中華人民共和國,國歌”。

1.安裝maven

將壓縮包上傳至master節點,並解壓到/opt/module/目錄下。

2.配置setting文件

<?xml version="1.0" encoding="UTF-8"?>
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
<!—倉庫地址-->
  <localRepository> /opt/module/apache-maven-3.0.5/repository</localRepository>
  <pluginGroups>  </pluginGroups>
  <proxies>  </proxies>
  <servers>  </servers>
  <mirrors>
    <mirror>
      <id>nexus-aliyun</id>
      <name>Nexus aliyun</name>
      <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
      <mirrorOf>central</mirrorOf>
    </mirror>
    <mirror>
      <id>repo2</id>
      <name>Mirror from Maven Repo2</name>
      <url>http://repo2.maven.org/maven2/</url>
      <mirrorOf>central</mirrorOf>
    </mirror>
    <mirror>
      <id>centor</id>
      <name>Mirror from Maven central</name>
      <url>http://central.maven.org/maven2/</url>
      <mirrorOf>central</mirrorOf>
    </mirror>
  </mirrors>
  <profiles>
    <profile>
  		<id>jdk-1.8</id>
  		<activation>
    		<activeByDefault>true</activeByDefault>
    		<jdk>1.8</jdk>
  		</activation>

  		<properties>
			<maven.compiler.source>1.8</maven.compiler.source>
    		<maven.compiler.target>1.8</maven.compiler.target>
    		<maven.compiler.compilerVersion>1.8</maven.compiler.compilerVersion>
  		</properties>
	</profile>
  </profiles>
</settings> 

3.創建倉庫目錄

[dendan@master apache-maven-3.0.5]$ mkdir repository

4.編輯profile

使用root用戶進行編輯

vi /etc/profile

內容如下:

# maven
export MAVEN_HOME=/opt/module/apache-maven-3.0.5
export PATH=$PATH:$MAVEN_HOME/bin

重新加載profile

source /etc/profile

5.解壓IK分詞器

[dendan@master software]$ unzip elasticsearch-analysis-ik-master.zip

6.進行打包編譯

[dendan@master software]$ cd elasticsearch-analysis-ik-master
[dendan@master elasticsearch-analysis-ik-master]$ mvn package -Pdist,native -DskipTests -Dtar

7.解壓並拷貝打包後的文件

[dendan@master releases]$ pwd
/opt/software/elasticsearch-analysis-ik-master/target/releases
[dendan@master releases]$ ll
總用量 4400
-rw-rw-r--. 1 dendan dendan 4502368 6月  24 10:21 elasticsearch-analysis-ik-5.6.1.zip
[dendan@master releases]$ unzip elasticsearch-analysis-ik-5.6.1.zip
[dendan@master releases]$ ll
總用量 4404
drwxrwxrwx. 3 dendan dendan    4096 6月  24 10:21 elasticsearch

8.安裝IK分詞器插件

將elasticsearch 目錄移到 elasticsearch/plugins目錄。

[dendan@master releases]$ cp -r elasticsearch /opt/module/elasticsearch-5.6.1/plugins/

9.啓動elasticsearch

[dendan@master elasticsearch-5.6.1]$ bin/elasticsearch

在日誌中會出現:

[2020-06-24T10:50:12,762][INFO ][o.e.p.PluginsService     ] [node-111] loaded plugin [analysis-ik]

10.IK分詞器測試

smart模式:

它會將句子儘量切分成儘量少的詞。

[dendan@master elasticsearch-5.6.1]$ curl -XGET 'http://master:9200/_analyze?pretty&analyzer=ik_smart' -d '它提供了一個分佈式多用戶能力的全文搜索引擎,基於RESTful web接口'

結果如下:

{
  "tokens" : [
    {
      "token" : "它",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "提供",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "了",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "一個",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "分佈式",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "多用戶",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "能力",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "的",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "CN_CHAR",
      "position" : 7
    },
    {
      "token" : "全文",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "搜索引擎",
      "start_offset" : 17,
      "end_offset" : 21,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "基於",
      "start_offset" : 22,
      "end_offset" : 24,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "restful",
      "start_offset" : 24,
      "end_offset" : 31,
      "type" : "ENGLISH",
      "position" : 11
    },
    {
      "token" : "web",
      "start_offset" : 32,
      "end_offset" : 35,
      "type" : "ENGLISH",
      "position" : 12
    },
    {
      "token" : "接口",
      "start_offset" : 35,
      "end_offset" : 37,
      "type" : "CN_WORD",
      "position" : 13
    }
  ]
}

max_word模式:

該模式下,會將句子切分成儘量多的詞。

[dendan@master elasticsearch-5.6.1]$ curl -XGET 'http://master:9200/_analyze?pretty&analyzer=ik_max_word' -d '它提供了一個分佈式多用戶能力的全文搜索引擎,基於RESTful web接口'
{
  "tokens" : [
    {
      "token" : "它",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "提供",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "了",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "一個",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "一",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "TYPE_CNUM",
      "position" : 4
    },
    {
      "token" : "個",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "COUNT",
      "position" : 5
    },
    {
      "token" : "分佈式",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "分佈",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "式",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 8
    },
    {
      "token" : "多用戶",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "多用",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "用戶",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 11
    },
    {
      "token" : "能力",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "的",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "CN_CHAR",
      "position" : 13
    },
    {
      "token" : "全文",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "搜索引擎",
      "start_offset" : 17,
      "end_offset" : 21,
      "type" : "CN_WORD",
      "position" : 15
    },
    {
      "token" : "搜索",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 16
    },
    {
      "token" : "索引",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "CN_WORD",
      "position" : 17
    },
    {
      "token" : "引擎",
      "start_offset" : 19,
      "end_offset" : 21,
      "type" : "CN_WORD",
      "position" : 18
    },
    {
      "token" : "基於",
      "start_offset" : 22,
      "end_offset" : 24,
      "type" : "CN_WORD",
      "position" : 19
    },
    {
      "token" : "restful",
      "start_offset" : 24,
      "end_offset" : 31,
      "type" : "ENGLISH",
      "position" : 20
    },
    {
      "token" : "web",
      "start_offset" : 32,
      "end_offset" : 35,
      "type" : "ENGLISH",
      "position" : 21
    },
    {
      "token" : "接口",
      "start_offset" : 35,
      "end_offset" : 37,
      "type" : "CN_WORD",
      "position" : 22
    }
  ]
}

javaAPI測試

/**
     * IK分詞器映射
     *
     * @throws Exception
     */
    @Test
    public void createMappingIk() throws Exception {

        // 1設置mapping
        XContentBuilder builder = XContentFactory.jsonBuilder()
                .startObject()
                .startObject("article")
                .startObject("properties")
                .startObject("id1")
                .field("type", "string")
                .field("store", "yes")
                .field("analyzer","ik_smart")// 設置分詞器
                .endObject()
                .startObject("title2")
                .field("type", "string")
                .field("store", "no")
                .field("analyzer","ik_smart")// 設置分詞器
                .endObject()
                .startObject("content")
                .field("type", "string")
                .field("store", "yes")
                .field("analyzer","ik_smart")// 設置分詞器
                .endObject()
                .endObject()
                .endObject()
                .endObject();

        // 2 添加mapping
        PutMappingRequest mapping = Requests.putMappingRequest("blog2").type("article").source(builder);

        client.admin().indices().putMapping(mapping).get();

        // 3 關閉資源
        client.close();
    }

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章