使用suggest的原因,最主要就是相比於search速度快,In general, we need the autosuggest feature to satisfy two main requirements:
■ It must be fast; there are few things that are more annoying than a clunky type- ahead solution that cannot keep up with users as they type. The Suggester must be able to update the suggestions as the user types each character, so millisec- onds matter.
■ It should return ranked suggestions ordered by term frequency, as there is little benefit to suggesting rare terms that occur in only a few documents in your index, especially when the user has typed only a few characters.
lucene Suggest
其中分析了AnalyzingInfixSuggester類的相關源碼,建立測試用例幫助理解整體過程。Suggest中手動根據其建立索引,在AnalyzingInfixSuggester類中,主要涉及到的屬性有:
- text:搜索關鍵字域,用戶輸入的搜索關鍵字是在該域上進行匹配,使用TextField,並進行store;
- exacttext: 與text的唯一區別是使用StringField並且不進行Store;
- contexts: 該域也是用於過濾的,只不過它爲比較次要的過濾條件域;
先根據InputIterator建立索引,示例中手寫了一個InputIterator來進行,InputIterator接口決定了用於suggest搜索的索引數據來源,用於suggest搜索的索引的每個默認域的域值都需要用戶自定義,建立的過程中涉及到下面幾個概念:
- key: 用於搜索字域,用戶輸入的搜索關鍵字分詞後的Term在這個域上進行匹配;
- content: 就是一個Term集合,用於contexts上的域進行TermQuery,在關鍵詞的基礎上再加個限制條件讓返回的熱詞列表更符合要求,例如分類,分組等信息(給定限定範圍,搜索襯衫,在男裝範圍內);
- weight:指定一個數字類型(int, long)的域,搜索結果將按照該域進行降序排序;
- payload:存儲一個額外信息,以ByteBuf存儲(其實就是byte[]方式存入索引),當搜索返回後,可以通過LookupResult結果對象的payload屬性返回並反序列化該值。
- allTermRequired: 搜索階段,是否所有用戶輸入的關鍵詞都需要全部匹配;
LookupResult包含了如下信息:
- key:用戶輸入的搜索關鍵字,再返回給你
- highlightKey:其實就是經過高亮的搜索關鍵字文本,假如你在搜索的時候設置了需要關鍵字高亮
- value:即InputInterator接口中weight方法的返回值,即返回的當前熱詞的權重值,排序就是根據這個值排的
- payload:就是InputInterator接口中payload方法中指定的payload信息,設計這個payload就是用來讓你存一些任意你想存的信息,這就留給你們自己去發揮想象了。
- contexts:同理即InputInterator接口中contexts方法的返回值再原樣返回給你。
Suggest索引的建立
從lucene suggester的源碼中可以看出,suggest在內部存在一個SearchManager和一個IndexWriter,建立索引:
@Override
public void build(InputIterator iter) throws IOException {
if (searcherMgr != null) {
searcherMgr.close();
searcherMgr = null;
}
if (writer != null) {
writer.close();
writer = null;
}
boolean success = false;
try {
// First pass: build a temporary normal Lucene index,
// just indexing the suggestions as they iterate:
writer = new IndexWriter(dir,
getIndexWriterConfig(getGramAnalyzer(), IndexWriterConfig.OpenMode.CREATE));
//long t0 = System.nanoTime();
// TODO: use threads?
BytesRef text;
while ((text = iter.next()) != null) {
BytesRef payload;
if (iter.hasPayloads()) {
payload = iter.payload();
} else {
payload = null;
}
add(text, iter.contexts(), iter.weight(), payload);
}
public void add(BytesRef text, Set<BytesRef> contexts, long weight, BytesRef payload) throws IOException {
ensureOpen();
writer.addDocument(buildDocument(text, contexts, weight, payload));
}
關鍵是其中的buildDocument,可以看出是通過在其中建立內部的Document並存儲來實現的
private Document buildDocument(BytesRef text, Set<BytesRef> contexts, long weight, BytesRef payload) throws IOException {
String textString = text.utf8ToString();
Document doc = new Document();
FieldType ft = getTextFieldType();
doc.add(new Field(TEXT_FIELD_NAME, textString, ft));
doc.add(new Field("textgrams", textString, ft));
doc.add(new StringField(EXACT_TEXT_FIELD_NAME, textString, Field.Store.NO));
doc.add(new BinaryDocValuesField(TEXT_FIELD_NAME, text));
doc.add(new NumericDocValuesField("weight", weight));
if (payload != null) {
doc.add(new BinaryDocValuesField("payloads", payload));
}
if (contexts != null) {
for(BytesRef context : contexts) {
doc.add(new StringField(CONTEXTS_FIELD_NAME, context, Field.Store.NO));
doc.add(new SortedSetDocValuesField(CONTEXTS_FIELD_NAME, context));
}
}
return doc;
}
Suggest查詢
使用suggest查詢是通過lookup方法來完成的,查詢過程使用的SORT是根據weight字段來定義的:
private static final Sort SORT = new Sort(new SortField("weight", SortField.Type.LONG, true));
建立一個比較大的BooleanQuery,其連接方式取決於allTermsRequired屬性:
if (allTermsRequired) {
occur = BooleanClause.Occur.MUST;
} else {
occur = BooleanClause.Occur.SHOULD;
}
使用QueryAnalyzer進行切詞,在最終的query加入單個TermQuery,注意這些Term都是以text爲關鍵詞的,
try (TokenStream ts = queryAnalyzer.tokenStream("", new StringReader(key.toString()))) {
//long t0 = System.currentTimeMillis();
ts.reset();
final CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
final OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
String lastToken = null;
query = new BooleanQuery.Builder();
int maxEndOffset = -1;
matchedTokens = new HashSet<>();
while (ts.incrementToken()) {
if (lastToken != null) {
matchedTokens.add(lastToken);
query.add(new TermQuery(new Term(TEXT_FIELD_NAME, lastToken)), occur);
}
lastToken = termAtt.toString();
if (lastToken != null) {
maxEndOffset = Math.max(maxEndOffset, offsetAtt.endOffset());
}
}
我們的示例中查詢contexts的時候,需要將region的字符串轉換爲BytesRef數組。
Set<BytesRef> contexts = new HashSet<>();
contexts.add(new BytesRef(region.getBytes("UTF8")));
List<Lookup.LookupResult> results = suggester.lookup(name, contexts, 2, true, false);
至此,Suggest組件的基本流程梳理完成。
Solr Suggest組件
首先,建立一個SearchComponent,用來設置提供suggest功能的組件
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">default</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">suggest</str>
<str name="weightField"></str>
<str name="suggestAnalyzerFieldType">string</str>
<str name="buildOnStartup">false</str>
</lst>
</searchComponent>
根據當前使用到的suggest組件,來繪製一份類圖幫助理解整體過程:
LookupFactory可以根據當前使用到的SolrCore和配置項來創建一個Lucene Suggester(Lookup)組件,我們使用到的InputIterator是根據Directory類來提供的,這兩個類均存在對應的工廠類。
我可以根據需要,選擇不同的Suggester類,以及對應Directionary組合來共同完成suggest提示。
在requestHandler中也需要加入聲明來進行/suggest,以相應http GET請求:
<requestHandler name="/suggest" class="org.apache.solr.handler.component.SearchHandler"
startup="lazy" >
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
爲了驗證各種類型的Suggester,我們可以在本地加入測試用例,開展測試相關工作。
在AnalyzingInfixSuggester中,InputIterator的使用方式如下:
writer = new IndexWriter(dir,
getIndexWriterConfig(getGramAnalyzer(), IndexWriterConfig.OpenMode.CREATE));
BytesRef text;
while ((text = iter.next()) != null) {
BytesRef payload;
if (iter.hasPayloads()) {
payload = iter.payload();
} else {
payload = null;
}
add(text, iter.contexts(), iter.weight(), payload);
}
FieldType中存在兩種Analyzer,index和query,在fieldType中進行配置。type string和text的主要區別在於是否會進行analyze,string是不需要的,當做一整個單詞,而text需要。
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
應用場景示例
假設我們有一張品牌關鍵字表,需要可以根據品牌的拼音搜索到對應的品牌名稱,我們在solr中使用下面的db-data-import語句來進行導入操作:
<entity name="gt_brand" query="
select brand_id, brand_name, brand_pinyin, brand_name_second, sort from gt_goods_brand
" >
<field column="brand_id" name="id"/>
<field column="brand_name" name="brand_name"/>
<field column="brand_pinyin" name="brand_pinyin"/>
<field column="brand_name_second" name="brand_name_second"/>
<field column="sort" name="sort"/>
</entity>
其中brand_pinyin作爲關鍵詞,sort作爲權重(weight),brand_name爲搜索後真正顯示的文本
Directory indexDir = FSDirectory.open(Paths.get("/Users/xxx/develop/tools/solr-5.5.0/server/solr/suggest/data/index"));
StandardAnalyzer analyzer = new StandardAnalyzer();
AnalyzingInfixSuggester suggester = new AnalyzingInfixSuggester(indexDir, analyzer);
DirectoryReader directoryReader = DirectoryReader.open(indexDir);
DocumentDictionary documentDictionary = new DocumentDictionary(directoryReader, "brand_pinyin", "sort", "brand_name");
suggester.build(documentDictionary.getEntryIterator());
List<Lookup.LookupResult> cha = suggester.lookup("nijiazhubao", 5, false, false);
for (Lookup.LookupResult lookupResult : cha) {
// System.out.println(lookupResult.key);
// System.out.println(lookupResult.value);
System.out.println(new String(lookupResult.payload.bytes, "UTF8"));
}
<str name="field">brand_pinyin</str>
<str name="weightField">sort</str>
<str name="payloadField">brand_name</str>
<str name="suggestAnalyzerFieldType">string</str>
<str name="buildOnStartup">true</str>
注意,處理的field一定需要有相應的analyzer(index, search)才能suggest出來:
視圖去建立多個searchComponent,因爲searchHandler可以包含多個searchComponent的名稱,但並沒有奏效:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">default</str>
<str name="lookupImpl">FuzzyLookupFactory</str> <!-- org.apache.solr.spelling.suggest.fst -->
<str name="dictionaryImpl">DocumentDictionaryFactory</str> <!-- org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory -->
<str name="field">category_name</str>
<str name="weightField"></str>
<str name="suggestAnalyzerFieldType">string</str>
</lst>
</searchComponent>
<searchComponent name="suggest1" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">default</str>
<str name="lookupImpl">FuzzyLookupFactory</str> <!-- org.apache.solr.spelling.suggest.fst -->
<str name="dictionaryImpl">DocumentDictionaryFactory</str> <!-- org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory -->
<str name="field">brand_name</str>
<str name="weightField"></str>
<str name="suggestAnalyzerFieldType">string</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">5</str>
</lst>
<arr name="components">
<str>suggest</str>
<str>suggest1</str>
</arr>
</requestHandler>
出現問題:
suggest: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /Users/xxx/develop/tools/solr-5.5.0/server/solr/suggest/data/analyzingInfixSuggesterIndexDir/write.lock
這其實也是indexPath導致的問題,當存在多個suggester配置的時候,需要將其索引對應的目錄分開(至少使用AnalyzingInfixLookupFactory的時候是這樣的,看源碼可以設置爲相對於core/data目錄的相對路徑:
String indexPath = params.get(INDEX_PATH) != null
? params.get(INDEX_PATH).toString()
: DEFAULT_INDEX_PATH;
if (new File(indexPath).isAbsolute() == false) {
indexPath = core.getDataDir() + File.separator + indexPath;
}