Lucene建立索引使用IKAnalyzer擴展詞庫

轉載原文：http://blog.163.com/iamlyia0_0/blog/static/50957997201481510100729/

方案一: 基於配置的詞典擴充

項目結構圖如下:

IK分詞器還支持通過配置IKAnalyzer.cfg.xml文件來擴充您的專有詞典。谷歌拼音詞庫下載: http://ishare.iask.sina.com.cn/f/14446921.html?from=like
在web項目的src目錄下創建IKAnalyzer.cfg.xml文件,內容如下

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">

<comment>IK Analyzer 擴展配置</comment>

<entry key="ext_dict">/dicdata/use.dic.dic;/dicdata/googlepy.dic</entry>

<entry key="ext_stopwords">/dicdata/ext_stopword.dic</entry>

</properties>

詞典文件的編輯與部署
分詞器的詞典文件格式是無BOM 的UTF-8 編碼的中文文本文件，文件擴展名不限。詞典中，每個中文詞彙獨立佔一行，使用\r\n 的DOS 方式換行。（注，如果您不瞭解什麼是無BOM 的UTF-8 格式，請保證您的詞典使用UTF-8 存儲，並在文件的頭部添加一空行）。您可以參考分詞器源碼org.wltea.analyzer.dic 包下的.dic 文件。詞典文件應部署在Java 的資源路徑下，即ClassLoader 能夠加載的路徑中。（推薦同IKAnalyzer.cfg.xml 放在一起）.

方案二:基於API的詞典擴充

在IKAnalyzer的與詞條相關的操作
1.org.wltea.analyzer.cfg
2.org.wltea.analyzer.dic

org.wltea.analyzer.cfg下Configuration接口中的定義
　　getExtDictionarys()  獲取擴展字典配置路徑
　　getExtStopWordDictionarys() 獲取擴展停止詞典配置路徑
　　getMainDictionary() 獲取主詞典路徑
　　getQuantifierDicionary() 獲取量詞詞典路徑
org.wltea.analyzer.cfg.DefualtConfig類是對Configuration接口的實現 

org.wltea.analyzer.dic下的Directory類中相關的方法

public void addWords(java.util.Collection<java.lang.String> words)
批量加載新詞條 參數：words - Collection詞條列表

public void disableWords(java.util.Collection<java.lang.String> words)
批量移除（屏蔽）詞條

Lucene中使用IKAnalyzer分詞器實例演示
業務實體

package com.icrate.service.study.demo;

/**

* @version ： 1.0

* @author ：蘇若年 <a href="mailto:[email protected]">發送郵件</a>

* @since ： 1.0 創建時間: 2013-4-7 下午01:52:49

* @function： TODO

public class Medicine {

private Integer id;

private String name;

private String function;

public Medicine() {

}

public Medicine(Integer id, String name, String function) {

super();

this.id = id;

this.name = name;

this.function = function;

}

//getter and setter()

public String toString(){

return this.id + "," +this.name + "," + this.function;

}

構建模擬數據

package com.icrate.service.study.demo;

import java.util.ArrayList;

import java.util.List;

/**

* @version ： 1.0

* @author ：蘇若年 <a href="mailto:[email protected]">發送郵件</a>

* @since ： 1.0 創建時間: 2013-4-7 下午01:54:34

* @function： TODO

public class DataFactory {

private static DataFactory dataFactory = new DataFactory();

private DataFactory(){

}

public List<Medicine> getData(){

List<Medicine> list = new ArrayList<Medicine>();

list.add(new Medicine(1,"銀花感冒顆粒","功能主治：銀花感冒顆粒，頭痛,清熱，解表，利咽。"));

list.add(new Medicine(2,"感冒止咳糖漿","功能主治：感冒止咳糖漿,解表清熱，止咳化痰。"));

list.add(new Medicine(3,"感冒靈顆粒","功能主治：解熱鎮痛。頭痛 ,清熱。"));

list.add(new Medicine(4,"感冒靈膠囊","功能主治：銀花感冒顆粒，頭痛,清熱，解表，利咽。"));

list.add(new Medicine(5,"仁和感冒顆粒","功能主治：疏風清熱，宣肺止咳,解表清熱，止咳化痰。"));

return list;

}

public static DataFactory getInstance(){

return dataFactory;

}

使用Lucene對模擬數據進行檢索

package com.icrate.service.study.demo;

import java.io.File;

import java.io.IOException;

import java.util.ArrayList;

import java.util.List;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.index.IndexReader;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.index.IndexWriterConfig;

import org.apache.lucene.index.Term;

import org.apache.lucene.queryParser.MultiFieldQueryParser;

import org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.Query;

import org.apache.lucene.search.ScoreDoc;

import org.apache.lucene.search.TopDocs;

import org.apache.lucene.search.highlight.Formatter;

import org.apache.lucene.search.highlight.Fragmenter;

import org.apache.lucene.search.highlight.Highlighter;

import org.apache.lucene.search.highlight.QueryScorer;

import org.apache.lucene.search.highlight.Scorer;

import org.apache.lucene.search.highlight.SimpleFragmenter;

import org.apache.lucene.search.highlight.SimpleHTMLFormatter;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

import org.apache.lucene.util.Version;

import org.wltea.analyzer.lucene.IKAnalyzer;

/**

* LuenceProcess.java

* @version ： 1.1

* @author ：蘇若年 <a href="mailto:[email protected]">發送郵件</a>

* @since ： 1.0 創建時間: Apr 3, 2013 11:48:11 AM

* TODO : Luence中使用IK分詞器

public class LuceneIKUtil {

private Directory directory ;

private Analyzer analyzer ;

/**

* 帶參數構造,參數用來指定索引文件目錄

* @param indexFilePath

public LuceneIKUtil(String indexFilePath){

try {

directory = FSDirectory.open(new File(indexFilePath));

analyzer = new IKAnalyzer();

} catch (IOException e) {

e.printStackTrace();

}

/**

* 默認構造,使用系統默認的路徑作爲索引

public LuceneIKUtil(){

this("/luence/index");

}

/**

* 創建索引

* Description：

* @author [email protected] Apr 3, 2013

* @throws Exception

public void createIndex()throws Exception{

IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_35,analyzer);

IndexWriter indexWriter = new IndexWriter(directory,indexWriterConfig);

indexWriter.deleteAll();

List<Medicine> list = DataFactory.getInstance().getData();

for(int i=0; i<list.size(); i++){

Medicine medicine = list.get(i);

Document document = addDocument(medicine.getId(), medicine.getName(), medicine.getFunction());

indexWriter.addDocument(document);

}

indexWriter.close();

}

/**

* Description：

* @author [email protected] Apr 3, 2013

* @param id

* @param title

* @param content

* @return

public Document addDocument(Integer id, String name, String function){

Document doc = new Document();

//Field.Index.NO 表示不索引

//Field.Index.ANALYZED 表示分詞且索引

//Field.Index.NOT_ANALYZED 表示不分詞且索引

doc.add(new Field("id",String.valueOf(id),Field.Store.YES,Field.Index.NOT_ANALYZED));

doc.add(new Field("name",name,Field.Store.YES,Field.Index.ANALYZED));

doc.add(new Field("function",function,Field.Store.YES,Field.Index.ANALYZED));

return doc;

}

/**

* Description：更新索引

* @author [email protected] Apr 3, 2013

* @param id

* @param title

* @param content

public void update(Integer id,String title, String content){

try {

IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_35,analyzer);

IndexWriter indexWriter = new IndexWriter(directory,indexWriterConfig);

Document document = addDocument(id, title, content);

Term term = new Term("id",String.valueOf(id));

indexWriter.updateDocument(term, document);

indexWriter.close();

} catch (Exception e) {

e.printStackTrace();

}

/**

* Description：按照ID進行索引

* @author [email protected] Apr 3, 2013

* @param id

public void delete(Integer id){

try {

IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_35,analyzer);

IndexWriter indexWriter = new IndexWriter(directory,indexWriterConfig);

Term term = new Term("id",String.valueOf(id));

indexWriter.deleteDocuments(term);

indexWriter.close();

} catch (Exception e) {

e.printStackTrace();

}

/**

* Description：查詢

* @author [email protected] Apr 3, 2013

* @param where 查詢條件

* @param scoreDoc 分頁時用

public List<Medicine> search(String[] fields,String keyword){

IndexSearcher indexSearcher = null;

List<Medicine> result = new ArrayList<Medicine>();

try {

//創建索引搜索器,且只讀

IndexReader indexReader = IndexReader.open(directory,true);

indexSearcher = new IndexSearcher(indexReader);

MultiFieldQueryParser queryParser =new MultiFieldQueryParser(Version.LUCENE_35, fields,analyzer);

Query query = queryParser.parse(keyword);

//返回前number條記錄

TopDocs topDocs = indexSearcher.search(query, 10);

//信息展示

int totalCount = topDocs.totalHits;

System.out.println("共檢索出 "+totalCount+" 條記錄");

//高亮顯示

創建高亮器,使搜索的結果高亮顯示

SimpleHTMLFormatter：用來控制你要加亮的關鍵字的高亮方式

此類有2個構造方法

1：SimpleHTMLFormatter()默認的構造方法.加亮方式：<B>關鍵字</B>

2：SimpleHTMLFormatter(String preTag, String postTag).加亮方式：preTag關鍵字postTag

Formatter formatter = new SimpleHTMLFormatter("<font color='red'>","</font>");

QueryScorer

QueryScorer 是內置的計分器。計分器的工作首先是將片段排序。QueryScorer使用的項是從用戶輸入的查詢中得到的；

它會從原始輸入的單詞、詞組和布爾查詢中提取項，並且基於相應的加權因子（boost factor）給它們加權。

爲了便於QueryScoere使用，還必須對查詢的原始形式進行重寫。

比如，帶通配符查詢、模糊查詢、前綴查詢以及範圍查詢等，都被重寫爲BoolenaQuery中所使用的項。

在將Query實例傳遞到QueryScorer之前，可以調用Query.rewrite (IndexReader)方法來重寫Query對象

Scorer fragmentScorer = new QueryScorer(query);

Highlighter highlighter = new Highlighter(formatter,fragmentScorer);

Fragmenter fragmenter = new SimpleFragmenter(100);

Highlighter利用Fragmenter將原始文本分割成多個片段。

內置的SimpleFragmenter將原始文本分割成相同大小的片段，片段默認的大小爲100個字符。這個大小是可控制的。

highlighter.setTextFragmenter(fragmenter);

ScoreDoc[] scoreDocs = topDocs.scoreDocs;

for(ScoreDoc scDoc : scoreDocs){

Document document = indexSearcher.doc(scDoc.doc);

Integer id = Integer.parseInt(document.get("id"));

String name = document.get("name");

String function = document.get("function");

//float score = scDoc.score; //相似度

String lighterName = highlighter.getBestFragment(analyzer, "name", name);

if(null==lighterName){

lighterName = name;

}

String lighterFunciton = highlighter.getBestFragment(analyzer, "function", function);

if(null==lighterFunciton){

lighterFunciton = function;

}

Medicine medicine = new Medicine();

medicine.setId(id);

medicine.setName(lighterName);

medicine.setFunction(lighterFunciton);

result.add(medicine);

}

} catch (Exception e) {

e.printStackTrace();

}finally{

try {

indexSearcher.close();

} catch (IOException e) {

e.printStackTrace();

}

return result;

}

public static void main(String[] args) {

LuceneIKUtil luceneProcess = new LuenceIKUtil("F:/index");

try {

luceneProcess.createIndex();

} catch (Exception e) {

e.printStackTrace();

}

//修改測試

luceneProcess.update(2, "測試內容", "修改測試。。。");

//查詢測試

String [] fields = {"name","function"};

List<Medicine> list = luenceProcess.search(fields,"感冒");

for(int i=0; i<list.size(); i++){

Medicine medicine = list.get(i);

System.out.println("("+medicine.getId()+")"+medicine.getName() + "\t" + medicine.getFunction());

}

//刪除測試

//luenceProcess.delete(1);

}

程序運行結果

加載擴展詞典：/dicdata/use.dic.dic
加載擴展詞典：/dicdata/googlepy.dic
加載擴展停止詞典：/dicdata/ext_stopword.dic
共檢索出 4 條記錄
(1)銀花 <font color='red'>感冒</font>顆粒    功能主治：銀花<font color='red'>感冒</font>顆粒 ，頭痛,清熱，解表，利咽。
(4)<font color='red'>感冒</font>靈膠囊    功能主治：銀花<font color='red'>感冒</font>顆粒 ，頭痛,清熱，解表，利咽。
(3)<font color='red'>感冒</font>靈顆粒    功能主治：解熱鎮痛。頭痛 ,清熱。
(5)仁和 <font color='red'>感冒</font>顆粒    功能主治：疏風清熱，宣肺止咳,解表清熱，止咳化痰。

如何判斷索引是否存在

/**
     * 判斷是否已經存在索引文件
     * @param indexPath
     * @return
     */
    private  boolean isExistIndexFile(String indexPath) throws Exception{
        File file = new File(indexPath);
        if (!file.exists()) {
            file.mkdirs();
        }
        String indexSufix="/segments.gen";
         //根據索引文件segments.gen是否存在判斷是否是第一次創建索引   
        File indexFile=new File(indexPath+indexSufix);
        return indexFile.exists();
    }

附錄: IK分詞處理過程

IK的整個分詞處理過程首先，介紹一下IK的整個分詞處理過程：

1. Lucene的分詞基類是Analyzer，所以IK提供了Analyzer的一個實現類IKAnalyzer。首先，我們要實例化一個IKAnalyzer，它有一個構造方法接收一個參數isMaxWordLength，這個參數是標識IK是否採用最大詞長分詞，還是採用最細粒度切分兩種分詞算法。實際兩種算法的實現，最大詞長切分是對最細粒度切分的一種後續處理，是對最細粒度切分結果的過濾，選擇出最長的分詞結果。

2. IKAnalyzer類重寫了Analyzer的tokenStream方法，這個方法接收兩個參數，field name和輸入流reader，其中filed name是Lucene的屬性列，是對文本內容進行過分詞處理和創建索引之後，索引對應的一個名稱，類似數據庫的列名。因爲IK僅僅涉及分詞處理，所以對field name沒有進行任何處理，所以此處不做任何討論。

3. tokenStream方法在Lucene對文本輸入流reader進行分詞處理時被調用，在IKAnalyzer的tokenStream方法裏面僅僅實例化了一個IKTokenizer類，該類繼承了Lucene的Tokenizer類。並重寫了incrementToken方法，該方法的作用是處理文本輸入流生成token，也就是Lucene的最小詞元term，在IK裏面叫做Lexeme。

4. 在IKtokenizer的構造方法裏面實例化了IK裏面最終要的分詞類IKSegmentation，也稱爲主分詞器。它的構造方法接收兩個參數，reader和isMaxWordLength。

5. IKsegmentation的構造方法裏面，主要做了三個工作，創建上下文對象Context，加載詞典，創建子分詞器。

6. Contex主要是存儲分詞結果集和記錄分詞處理的遊標位置。

7. 詞典是作爲一個單例被創建的，主要有量詞詞典、主詞典和停詞詞典。詞典是被存儲在字典片段類DictSegment 這個字典核心類裏面的。DictSegment有一個靜態的存儲結構charMap，是公共詞典表，用來存儲所有漢字，key和value都是一箇中文漢字，目前IK裏面的charMap大概有7100多的鍵值對。另外，DictSegment還有兩個最重要的數據結構，是用來存儲字典樹的，一個是DictSegment的數組childrenArray，另一個是key爲單個漢字（每個詞條的第一個漢字），value是DictSegment的HashMap childrenMap。這兩個數據結構二者取其一，用來存儲字典樹。

8. 子分詞器纔是真正的分詞類，IK裏面有三個子分詞器，量詞分詞器，CJK分詞器（處理中文），停詞分詞器。主分詞器IKSegmentation遍歷這三個分詞器對文本輸入流進行分詞處理。

9. IKTokenizer的incrementToken方法調用了IKSegmentation的next方法，next的作用是獲得下一個分詞結果。next在第一次被調用的時候，需要加載文本輸入流，並將其讀入buffer，此時便遍歷子分詞器，對buffer種的文本內容進行分詞處理，然後把分詞結果添加到context的lexemeSet中。

--__2__--

發佈了1 篇原創文章 · 獲贊 5 · 訪問量 14萬+

私信關注

Lucene建立索引使用IKAnalyzer擴展詞庫

IKAnalyzer詞典佔用內存大小分析

基於Mahout的電影推薦系統

Lucene建立索引使用IKAnalyzer擴展詞庫

文本相似度算法_基礎

JUnit4：多組數據的單元測試

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Lucene建立索引 使用IKAnalyzer擴展詞庫

Lucene建立索引使用IKAnalyzer擴展詞庫