suggest應用場景
用戶的輸入行爲是不確定的,而我們在寫程序的時候總是想讓用戶按照指定的內容或指定格式的內容進行搜索,這裏就要進行人工干預用戶輸入的搜索條件了;我們在用百度谷歌等搜索引擎的時候經常會看到按鍵放下的時候直接會提示用戶是否想搜索某些相關的內容,恰好lucene在開發的時候想到了這一點,lucene提供的suggest包正是用來解決上述問題的。
suggest包聯想詞相關介紹
suggest包提供了lucene的自動補全或者拼寫檢查的支持;
拼寫檢查相關的類在org.apache.lucene.search.spell包下;
聯想相關的在org.apache.lucene.search.suggest包下;
基於聯想詞分詞相關的類在org.apache.lucene.search.suggest.analyzing包下;
拼寫檢查原理
- Lucene的拼寫檢查由org.apache.lucene.search.spell.SpellChecker類提供支持;
- SpellChecker設置了默認精度0.5,如果我們需要細粒度的支持可以通過調用setAccuracy(float accuracy)來設定;
- spellChecker會將外部來源的詞進行索引;
這些來源包括:
DocumentDictionary查詢document中的field對應的值;
FileDictionary基於一個文本文件的Directionary,每行一項,詞組之間以"\t" TAB分隔符進行,每項中不能含有兩個以上的分隔符;
HighFrequencyDictionary從原有的索引文件中讀取某個term的值,並按照出現次數檢查;
LuceneDictionary也是從原有索引文件中讀取某個term的值,但是不檢查出現次數;
PlainTextDictionary從文本中讀取內容,按行讀取,沒有分隔符;
其索引的原理如下:
- 對索引過程加syschronized同步;
- 檢查Spellchecker是否已經關閉,如果關閉,拋出異常,提示內容爲:Spellchecker has been closed;
- 對外部來源的索引進行遍歷,統計被遍歷的詞的長度,如果長度小於三,忽略該詞,反之構建document對象並索引到本地文件,創建索引的時候會對每個單詞進行詳細拆分(對應addGram方法),其執行過程如下所示
/**
* Indexes the data from the given {@link Dictionary}.
* @param dict Dictionary to index
* @param config {@link IndexWriterConfig} to use
* @param fullMerge whether or not the spellcheck index should be fully merged
* @throws AlreadyClosedException if the Spellchecker is already closed
* @throws IOException If there is a low-level I/O error.
*/
public final void indexDictionary(Dictionary dict, IndexWriterConfig config, boolean fullMerge) throws IOException {
synchronized (modifyCurrentIndexLock) {
ensureOpen();
final Directory dir = this.spellIndex;
final IndexWriter writer = new IndexWriter(dir, config);
IndexSearcher indexSearcher = obtainSearcher();
final List<TermsEnum> termsEnums = new ArrayList<>();
final IndexReader reader = searcher.getIndexReader();
if (reader.maxDoc() > 0) {
for (final LeafReaderContext ctx : reader.leaves()) {
Terms terms = ctx.reader().terms(F_WORD);
if (terms != null)
termsEnums.add(terms.iterator(null));
}
}
boolean isEmpty = termsEnums.isEmpty();
try {
BytesRefIterator iter = dict.getEntryIterator();
BytesRef currentTerm;
terms: while ((currentTerm = iter.next()) != null) {
String word = currentTerm.utf8ToString();
int len = word.length();
if (len < 3) {
continue; // too short we bail but "too long" is fine...
}
if (!isEmpty) {
for (TermsEnum te : termsEnums) {
if (te.seekExact(currentTerm)) {
continue terms;
}
}
}
// ok index the word
Document doc = createDocument(word, getMin(len), getMax(len));
writer.addDocument(doc);
}
} finally {
releaseSearcher(indexSearcher);
}
if (fullMerge) {
writer.forceMerge(1);
}
// close writer
writer.close();
// TODO: this isn't that great, maybe in the future SpellChecker should take
// IWC in its ctor / keep its writer open?
// also re-open the spell index to see our own changes when the next suggestion
// is fetched:
swapSearcher(dir);
}
}
對詞語進行遍歷拆分的方法爲addGram,其實現爲:
查看代碼可知,聯想詞的索引不僅關注每個詞的起始位置,也關注其倒數的位置;
- 聯想詞查詢的時候,先判斷grams裏邊是否包含有待查詢的詞拆分後的內容,如果有放到結果SuggestWordQueue中,最終結果爲遍歷SuggestWordQueue得來的String[],其代碼實現如下:
public String[] suggestSimilar(String word, int numSug, IndexReader ir, String field, SuggestMode suggestMode, float accuracy) throws IOException { // obtainSearcher calls ensureOpen final IndexSearcher indexSearcher = obtainSearcher(); try { if (ir == null || field == null) { suggestMode = SuggestMode.SUGGEST_ALWAYS; } if (suggestMode == SuggestMode.SUGGEST_ALWAYS) { ir = null; field = null; } final int lengthWord = word.length(); final int freq = (ir != null && field != null) ? ir.docFreq(new Term(field, word)) : 0; final int goalFreq = suggestMode==SuggestMode.SUGGEST_MORE_POPULAR ? freq : 0; // if the word exists in the real index and we don't care for word frequency, return the word itself if (suggestMode==SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX && freq > 0) { return new String[] { word }; } BooleanQuery query = new BooleanQuery(); String[] grams; String key; for (int ng = getMin(lengthWord); ng <= getMax(lengthWord); ng++) { key = "gram" + ng; // form key grams = formGrams(word, ng); // form word into ngrams (allow dups too) if (grams.length == 0) { continue; // hmm } if (bStart > 0) { // should we boost prefixes? add(query, "start" + ng, grams[0], bStart); // matches start of word } if (bEnd > 0) { // should we boost suffixes add(query, "end" + ng, grams[grams.length - 1], bEnd); // matches end of word } for (int i = 0; i < grams.length; i++) { add(query, key, grams[i]); } } int maxHits = 10 * numSug; // System.out.println("Q: " + query); ScoreDoc[] hits = indexSearcher.search(query, maxHits).scoreDocs; // System.out.println("HITS: " + hits.length()); SuggestWordQueue sugQueue = new SuggestWordQueue(numSug, comparator); // go thru more than 'maxr' matches in case the distance filter triggers int stop = Math.min(hits.length, maxHits); SuggestWord sugWord = new SuggestWord(); for (int i = 0; i < stop; i++) { sugWord.string = indexSearcher.doc(hits[i].doc).get(F_WORD); // get orig word // don't suggest a word for itself, that would be silly if (sugWord.string.equals(word)) { continue; } // edit distance sugWord.score = sd.getDistance(word,sugWord.string); if (sugWord.score < accuracy) { continue; } if (ir != null && field != null) { // use the user index sugWord.freq = ir.docFreq(new Term(field, sugWord.string)); // freq in the index // don't suggest a word that is not present in the field if ((suggestMode==SuggestMode.SUGGEST_MORE_POPULAR && goalFreq > sugWord.freq) || sugWord.freq < 1) { continue; } } sugQueue.insertWithOverflow(sugWord); if (sugQueue.size() == numSug) { // if queue full, maintain the minScore score accuracy = sugQueue.top().score; } sugWord = new SuggestWord(); } // convert to array string String[] list = new String[sugQueue.size()]; for (int i = sugQueue.size() - 1; i >= 0; i--) { list[i] = sugQueue.pop().string; } return list; } finally { releaseSearcher(indexSearcher); } }
編程實踐
以下是我根據FileDirectory相關描述編寫的一個測試程序
package com.lucene.search;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.search.spell.SpellChecker;
import org.apache.lucene.search.suggest.FileDictionary;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.wltea.analyzer.lucene.IKAnalyzer;
public class SuggestUtil {
public static void main(String[] args) {
Directory spellIndexDirectory;
try {
spellIndexDirectory = FSDirectory.open(Paths.get("suggest", new String[0]));
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory );
Analyzer analyzer = new IKAnalyzer(true);
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
spellchecker.setAccuracy(0f);
//HighFrequencyDictionary dire = new HighFrequencyDictionary(reader, field, thresh)
spellchecker.indexDictionary(new FileDictionary(new FileInputStream(new File("D:\\hadoop\\lucene_suggest\\src\\suggest.txt"))),config,false);
String[] similars = spellchecker.suggestSimilar("中國", 10);
for (String similar : similars) {
System.out.println(similar);
}
spellIndexDirectory.close();
spellchecker.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
其中,我用的suggest.txt內容爲:
中國人民 100
奔馳3 101
奔馳中國 102
奔馳S級 103
奔馳A級 104
奔馳C級 105
測試結果爲:
中國人民
奔馳中國
一步一步跟我學習lucene是對近期做lucene索引的總結,大家有問題的話聯繫本人的Q-Q: 891922381,同時本人新建Q-Q羣:106570134(lucene,solr,netty,hadoop),大家共同探討,本人爭取每日一博,希望大家持續關注,會帶給大家驚喜的