全文檢索Solr集成HanLP中文分詞

以前發佈過HanLP的Lucene插件，後來很多人跟我說其實Solr更流行（反正我是覺得既然Solr是Lucene的子項目，那麼稍微改改配置就能支持Solr），於是就抽空做了個Solr插件出來，開源在Github上，歡迎改進。

HanLP中文分詞solr插件支持Solr5.x，兼容Lucene5.x。

圖1

快速上手

1、將hanlp-portable.jar和hanlp-solr-plugin.jar共兩個jar放入${webapp}/WEB-INF/lib下

2、修改solr core的配置文件${core}/conf/schema.xml：

</analyzer>

</analyzer>

</fieldType>

Solr5中文分詞器詳細配置

對於新手來說，上面的兩步可能太簡略了，不如看看下面的step by step。本教程使用Solr5.2.1，理論上兼容solr5.x。

放置jar

將上述兩個jar放到solr-5.2.1/server/solr-webapp/webapp/WEB-INF/lib目錄下。如果你想自定義詞典等數據，將hanlp.properties放到solr-5.2.1/server/resources，該目錄也是log4j.properties等配置文件的放置位置。HanLP文檔一直在說“將配置文件放到resources目錄下”，指的就是這個意思。作爲Java程序員，這是基本常識。

啓動solr

首先在solr-5.2.1\bin目錄下啓動solr：

1.solr start -f

用瀏覽器打開http://localhost:8983/solr/#/，看到如下頁面說明一切正常：

圖2

創建core

在solr-5.2.1\server\solr下新建一個目錄，取個名字比如叫one，將示例配置文件solr-5.2.1\server\solr\configsets\sample_techproducts_configs\conf拷貝過來，接着修改schema.xml中的默認域type，搜索

1. <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">

2. ...

3. </fieldType>

替換爲

1. <!-- 默認文本類型: 指定使用HanLP分詞器，同時開啓索引模式。

2. 通過solr自帶的停用詞過濾器，使用"stopwords.txt"（默認空白）過濾。

3. 在搜索的時候，還支持solr自帶的同義詞詞典。-->

4. <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">

5. <analyzer type="index">

6. <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="true"/>

7. <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />

8. <!-- 取消註釋可以啓用索引期間的同義詞詞典

9. <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>

10. -->

11. <filter class="solr.LowerCaseFilterFactory"/>

12. </analyzer>

13. <analyzer type="query">

14. <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="true"/>

15. <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />

16. <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

17. <filter class="solr.LowerCaseFilterFactory"/>

18. </analyzer>

19. </fieldType>

意思是默認文本字段類型啓用HanLP分詞器，text_general還開啓了solr默認的各種filter。

solr允許爲不同的字段指定不同的分詞器，由於絕大部分字段都是text_general類型的，可以說這種做法比較適合新手。如果你是solr老手的話，你可能會更喜歡單獨爲不同的字段指定不同的分詞器及其他配置。如果你的業務系統中有其他字段，比如location，summary之類，也需要一一指定其type="text_general"。切記，否則這些字段仍舊是solr默認分詞器，會造成這些字段“搜索不到”。

另外，切記不要在query中開啓indexMode，否則會影響PhaseQuery。indexMode只需在index中開啓一遍即可，要不然它怎麼叫indexMode呢。

如果你不需要solr提供的停用詞、同義詞等filter，如下配置可能更適合你：

1. <fieldType name="text_cn" class="solr.TextField">

2. <analyzer type="index">

3. <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="true"/>

4. </analyzer>

5. <analyzer type="query">

6.

7. <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="false"/>

8. </analyzer>

9. </fieldType>

10.

11. <field name="my_field1" type="text_cn" indexed="true" stored="true"/>

12. <field name="my_field2" type="text_cn" indexed="true" stored="true"/>

完成了之後在solr的管理界面導入這個core one：

圖3

接着就能在下拉列表中看到這個core了：

圖4

上傳測試文檔

修改好了，就可以拿一些測試文檔來試試效果了。hanlp-solr-plugin代碼庫中的src/test/resources下有個測試文檔集合documents.csv，其內容如下：

1. id,title

2. 1,你好世界

3. 2,商品和服務

4. 3,和服的價格是每鎊15便士

5. 4,服務大衆

6. 5,hanlp工作正常

代表着id從1到5共五個文檔，接下來複制solr-5.2.1\example\exampledocs下的上傳工具post.jar到resources目錄，利用如下命令行將數據導入：

1. java -Dc=one -Dtype=application/csv -jar post.jar *.csv

Windows用戶的話直接雙擊該目錄下的upload.cmd即可，Linux用戶運行upload.sh。

正常情況下輸出如下結果：

1. SimplePostTool version 5.0.0

2. Posting files to [base] url http://localhost:8983/solr/one/update using content-

3. type application/csv...

4. POSTing file documents.csv to [base]

5. 1 files indexed.

6. COMMITting Solr index changes to http://localhost:8983/solr/one/update...

7. Time spent: 0:00:00.059

8. 請按任意鍵繼續. . .

同時刷新一下core one的Overview，的確看到了5篇文檔：

圖5

搜索文檔

是時候看看HanLP分詞的效果了，點擊左側面板的Query，輸入“和服”試試：

圖6

發現精確地查到了“和服的價格是每鎊15便士”，而不是“商品和服務”這種錯誤文檔：

圖7

這說明HanLP工作良好。

要知道，不少中文分詞器眉毛鬍子一把抓地命中“商品和服務”這種錯誤文檔，降低了查準率，拉低了用戶體驗，跟原始的MySQL LIKE有何區別？

索引模式的功能

索引模式可以對長詞進行全切分，得到其中蘊含的所有詞彙。比如“中醫藥大學附屬醫院”在HanLP索引分詞模式下的切分結果爲：

1. 中0 醫1 藥2 大3 學4 附5 屬6 醫7 院8

2. [0:3 1] 中醫藥/n

3. [0:2 1] 中醫/n

4. [1:3 1] 醫藥/n

5. [3:5 1] 大學/n

6. [5:9 1] 附屬醫院/nt

7. [5:7 1] 附屬/vn

8. [7:9 1] 醫院/n

開啓indexMode後，無論用戶搜索“中醫”“中醫藥”還是“醫藥”，都會搜索到“中醫藥大學附屬醫院”：

圖8

高級配置

目前本插件支持如下基於schema.xml的配置:

圖9

對於更高級的配置，HanLP分詞器主要通過class path下的hanlp.properties進行配置，請閱讀HanLP自然語言處理包文檔以瞭解更多相關配置，如：

1.停用詞

2.用戶詞典

3.詞性標註

4.……

代碼調用

在Query改寫的時候，可以利用HanLPAnalyzer分詞結果中的詞性等屬性，如

1. String text = "×××很遼闊";

2. for (int i = 0; i < text.length(); ++i)

3. {

4. System.out.print(text.charAt(i) + "" + i + " ");

5. }

6. System.out.println();

7. Analyzer analyzer = new HanLPAnalyzer();

8. TokenStream tokenStream = analyzer.tokenStream("field", text);

9. tokenStream.reset();

10. while (tokenStream.incrementToken())

11. {

12. CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class);

13. // 偏移量

14. OffsetAttribute offsetAtt = tokenStream.getAttribute(OffsetAttribute.class);

15. // 距離

16. PositionIncrementAttribute positionAttr = kenStream.getAttribute(PositionIncrementAttribute.class);

17. // 詞性

18. TypeAttribute typeAttr = tokenStream.getAttribute(TypeAttribute.class);

19. System.out.printf("[%d:%d %d] %s/%s\n", offsetAtt.startOffset(), offsetAtt.endOffset(), positionAttr.getPositionIncrement(), attribute, typeAttr.type());

20. }

在另一些場景，支持以自定義的分詞器（比如開啓了命名實體識別的分詞器、繁體中文分詞器、CRF分詞器等）構造HanLPTokenizer，比如：

1. tokenizer = new HanLPTokenizer(HanLP.newSegment()

2. .enableJapaneseNameRecognize(true)

3. .enableIndexMode(true), null, false);

4. tokenizer.setReader(new StringReader("林志玲亮相網友:確定不是波多野結衣？"));

5. ...

反饋

技術問題請在Github上發issue ，大家一起討論，也方便集中管理。博客留言、微博私信、郵件不受理任何HanLP相關的問題，謝謝合作！

反饋問題的時候請一定附上版本號、觸發代碼、輸入輸出，否則無法處理。

版權

Apache License Version 2.0

轉載子碼農場

全文檢索Solr集成HanLP中文分詞

放置jar

啓動solr

創建core

上傳測試文檔

搜索文檔

索引模式的功能

高級配置

代碼調用

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

HanLP封裝爲web services服務的過程介紹

hanlp分詞工具應用案例：商品圖自動推薦功能的應用

HanLP分詞工具中的ViterbiSegment分詞流程

hanlp自然語言處理包的人名識別代碼解析

HanLP-命名實體識別總結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結