【Elasticsearch】安装使用ik中文分词

原創

2020-02-23 15:21

序言

Elasticsearch默认提供的分词器，会把每个汉字分开，而不是我们想要的根据关键词来分词。例如：

curl -XPOST  "http://localhost:9200/test/_analyze?analyzer=standard&pretty=true&text=我是中国人"

我们会得到这样的结果：

{  
    tokens: [  
        {  
            token: text  
            start_offset: 2  
            end_offset: 6  
            type: <ALPHANUM>  
            position: 1  
        },
        {  
            token: 我  
            start_offset: 9  
            end_offset: 10  
            type: <IDEOGRAPHIC>  
            position: 2  
        },
        {  
            token: 是  
            start_offset: 10  
            end_offset: 11  
            type: <IDEOGRAPHIC>  
            position: 3  
        },
        {  
            token: 中  
            start_offset: 11  
            end_offset: 12  
            type: <IDEOGRAPHIC>  
            position: 4  
        },
        {  
            token: 国  
            start_offset: 12  
            end_offset: 13  
            type: <IDEOGRAPHIC>  
            position: 5  
        },
        {  
            token: 人  
            start_offset: 13  
            end_offset: 14  
            type: <IDEOGRAPHIC>  
            position: 6  
        }  
    ]  
}

正常情况下，这不是我们想要的结果，比如我们更希望 “中国人”，“中国”，“我”这样的分词，这样我们就需要安装中文分词插件，ik就是实现这个功能的。

安装

elasticsearch-analysis-ik 是一款中文的分词插件，支持自定义词库。
安装步骤：

到github网站下载源代码，网站地址为：https://github.com/medcl/elasticsearch-analysis-ik
master为最新版本，tag可以选择已经release的版本。
右侧下方有一个按钮“Download ZIP”，点击下载源代码elasticsearch-analysis-ik-master.zip。
解压文件elasticsearch-analysis-ik-master.zip，进入下载目录，执行命令：
unzip elasticsearch-analysis-ik-master.zip
将解压目录文件中config/ik文件夹复制到ES安装目录config文件夹下。
因为是源代码，此处需要使用maven打包，进入解压文件夹中，执行命令：
mvn clean package
将打包得到的jar文件elasticsearch-analysis-ik-1.2.8-sources.jar复制到ES安装目录的lib目录下。
在ES的配置文件config/elasticsearch.yml中增加ik的配置，在最后增加：

index:  
  analysis:                     
    analyzer:        
      ik:  
          alias: [ik_analyzer]  
          type: org.elasticsearch.index.analysis.IkAnalyzerProvider  
      ik_max_word:  
          type: ik  
          use_smart: false  
      ik_smart:  
          type: ik  
          use_smart: true

或

index.analysis.analyzer.ik.type : "ik"
7. 重新启动elasticsearch服务，这样就完成配置了，收入命令：
curl -XPOST "http://localhost:9200/test/_analyze?analyzer=ik&pretty=true&text=我是中国人"
测试结果如下：

{  
    tokens: [  
        {  
            token: text  
            start_offset: 2  
            end_offset: 6  
            type: ENGLISH  
            position: 1  
        },
        {  
            token: 我  
            start_offset: 9  
            end_offset: 10  
            type: CN_CHAR  
            position: 2  
        },
        {  
            token: 中国人  
            start_offset: 11  
            end_offset: 14  
            type: CN_WORD  
            position: 3  
        },
        {  
            token: 中国  
            start_offset: 11  
            end_offset: 13  
            type: CN_WORD  
            position: 4  
        },
        {  
            token: 国人  
            start_offset: 12  
            end_offset: 14  
            type: CN_WORD  
            position: 5  
        }  
    ]  
}

说明：

ES安装插件本来使用使用命令plugin来完成，但是我本机安装ik时一直不成功，所以就使用源代码打包安装了。
自定义词库的方式，请参考 https://github.com/medcl/elasticsearch-analysis-ik

note：

target是jar的输出目录，release目录是ik的jar包和依赖包的输出目录，如果没有引入ik的依赖包会导致出现：
nested: NoClassDefFoundError[org/apache/http/client/ClientProtocolException]
错误

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【Elasticsearch】安装使用ik中文分词

序言

安装

说明：

note：

再谈23种设计模式（3）：行为型模式（学习笔记）

Power Automate Desktop 安装完，登录后老是提示one driver 错误

微前端学习笔记(4):从微前端到微模块之EMP与hel-micro方案探索

微前端学习笔记（1）：微前端总体架构概述，从微服务发微

985 硕士程序员，空窗 4 个月没有 Offer！

一文搞懂 Spring 循环依赖

赛博斗地主——使用大语言模型扮演Agent智能体玩牌类游戏。

VScode右键打开(添加到右键)

记一次 .NET某工控视觉自动化系统卡死分析

WindowsServer--SQL Server搭建主从同步实现读写分离 - 事务性分发

【Linux進階】CentOS安裝MySQL數據庫

【Linux進階】CentOS安裝java環境

【Linux進階】Linux防火牆iptables詳解

【Elasticsearch】基礎知識

Java回調機制(CallBack)詳解

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結