Solr Suggest組件的使用

使用suggest的原因，最主要就是相比於search速度快，In general, we need the autosuggest feature to satisfy two main requirements:

■ It must be fast; there are few things that are more annoying than a clunky type- ahead solution that cannot keep up with users as they type. The Suggester must be able to update the suggestions as the user types each character, so millisec- onds matter.

■ It should return ranked suggestions ordered by term frequency, as there is little benefit to suggesting rare terms that occur in only a few documents in your index, especially when the user has typed only a few characters.

lucene Suggest

http://iamyida.iteye.com/blog/2205114

其中分析了AnalyzingInfixSuggester類的相關源碼，建立測試用例幫助理解整體過程。Suggest中手動根據其建立索引，在AnalyzingInfixSuggester類中，主要涉及到的屬性有：

text：搜索關鍵字域，用戶輸入的搜索關鍵字是在該域上進行匹配，使用TextField，並進行store；
exacttext: 與text的唯一區別是使用StringField並且不進行Store；
contexts: 該域也是用於過濾的，只不過它爲比較次要的過濾條件域；

先根據InputIterator建立索引，示例中手寫了一個InputIterator來進行，InputIterator接口決定了用於suggest搜索的索引數據來源，用於suggest搜索的索引的每個默認域的域值都需要用戶自定義，建立的過程中涉及到下面幾個概念：

key: 用於搜索字域，用戶輸入的搜索關鍵字分詞後的Term在這個域上進行匹配；
content: 就是一個Term集合，用於contexts上的域進行TermQuery，在關鍵詞的基礎上再加個限制條件讓返回的熱詞列表更符合要求，例如分類，分組等信息（給定限定範圍，搜索襯衫，在男裝範圍內）；
weight：指定一個數字類型(int, long)的域，搜索結果將按照該域進行降序排序；
payload：存儲一個額外信息，以ByteBuf存儲（其實就是byte[]方式存入索引），當搜索返回後，可以通過LookupResult結果對象的payload屬性返回並反序列化該值。
allTermRequired: 搜索階段，是否所有用戶輸入的關鍵詞都需要全部匹配；

LookupResult包含了如下信息：

key:用戶輸入的搜索關鍵字，再返回給你
highlightKey：其實就是經過高亮的搜索關鍵字文本，假如你在搜索的時候設置了需要關鍵字高亮
value：即InputInterator接口中weight方法的返回值，即返回的當前熱詞的權重值，排序就是根據這個值排的
payload：就是InputInterator接口中payload方法中指定的payload信息，設計這個payload就是用來讓你存一些任意你想存的信息，這就留給你們自己去發揮想象了。
contexts：同理即InputInterator接口中contexts方法的返回值再原樣返回給你。

Suggest索引的建立

從lucene suggester的源碼中可以看出，suggest在內部存在一個SearchManager和一個IndexWriter，建立索引：

@Override
  public void build(InputIterator iter) throws IOException {

    if (searcherMgr != null) {
      searcherMgr.close();
      searcherMgr = null;
    }

    if (writer != null) {
      writer.close();
      writer = null;
    }

    boolean success = false;
    try {
      // First pass: build a temporary normal Lucene index,
      // just indexing the suggestions as they iterate:
      writer = new IndexWriter(dir,
                               getIndexWriterConfig(getGramAnalyzer(), IndexWriterConfig.OpenMode.CREATE));
      //long t0 = System.nanoTime();

      // TODO: use threads?
      BytesRef text;
      while ((text = iter.next()) != null) {
        BytesRef payload;
        if (iter.hasPayloads()) {
          payload = iter.payload();
        } else {
          payload = null;
        }

        add(text, iter.contexts(), iter.weight(), payload);
      }

public void add(BytesRef text, Set<BytesRef> contexts, long weight, BytesRef payload) throws IOException {
    ensureOpen();
    writer.addDocument(buildDocument(text, contexts, weight, payload));
  }

關鍵是其中的buildDocument，可以看出是通過在其中建立內部的Document並存儲來實現的

private Document buildDocument(BytesRef text, Set<BytesRef> contexts, long weight, BytesRef payload) throws IOException {
    String textString = text.utf8ToString();
    Document doc = new Document();
    FieldType ft = getTextFieldType();
    doc.add(new Field(TEXT_FIELD_NAME, textString, ft));
    doc.add(new Field("textgrams", textString, ft));
    doc.add(new StringField(EXACT_TEXT_FIELD_NAME, textString, Field.Store.NO));
    doc.add(new BinaryDocValuesField(TEXT_FIELD_NAME, text));
    doc.add(new NumericDocValuesField("weight", weight));
    if (payload != null) {
      doc.add(new BinaryDocValuesField("payloads", payload));
    }
    if (contexts != null) {
      for(BytesRef context : contexts) {
        doc.add(new StringField(CONTEXTS_FIELD_NAME, context, Field.Store.NO));
        doc.add(new SortedSetDocValuesField(CONTEXTS_FIELD_NAME, context));
      }
    }
    return doc;
  }

Suggest查詢

使用suggest查詢是通過lookup方法來完成的，查詢過程使用的SORT是根據weight字段來定義的：

private static final Sort SORT = new Sort(new SortField("weight", SortField.Type.LONG, true));

建立一個比較大的BooleanQuery，其連接方式取決於allTermsRequired屬性：

if (allTermsRequired) {
      occur = BooleanClause.Occur.MUST;
    } else {
      occur = BooleanClause.Occur.SHOULD;
    }

使用QueryAnalyzer進行切詞，在最終的query加入單個TermQuery，注意這些Term都是以text爲關鍵詞的，

try (TokenStream ts = queryAnalyzer.tokenStream("", new StringReader(key.toString()))) {
      //long t0 = System.currentTimeMillis();
      ts.reset();
      final CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
      final OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
      String lastToken = null;
      query = new BooleanQuery.Builder();
      int maxEndOffset = -1;
      matchedTokens = new HashSet<>();
      while (ts.incrementToken()) {
        if (lastToken != null) {  
          matchedTokens.add(lastToken);
          query.add(new TermQuery(new Term(TEXT_FIELD_NAME, lastToken)), occur);
        }
        lastToken = termAtt.toString();
        if (lastToken != null) {
          maxEndOffset = Math.max(maxEndOffset, offsetAtt.endOffset());
        }
      }

我們的示例中查詢contexts的時候，需要將region的字符串轉換爲BytesRef數組。

Set<BytesRef> contexts = new HashSet<>();
        contexts.add(new BytesRef(region.getBytes("UTF8")));
        List<Lookup.LookupResult> results = suggester.lookup(name, contexts, 2, true, false);

至此，Suggest組件的基本流程梳理完成。

Solr Suggest組件

在Solr中是如何定義並使用suggest組件的，可以參考：https://cwiki.apache.org/confluence/display/solr/Suggester

首先，建立一個SearchComponent，用來設置提供suggest功能的組件

<searchComponent name="suggest" class="solr.SuggestComponent">
    <lst name="suggester">
      <str name="name">default</str>
      <str name="lookupImpl">FuzzyLookupFactory</str>      
      <str name="dictionaryImpl">DocumentDictionaryFactory</str>
      <str name="field">suggest</str>
      <str name="weightField"></str>
      <str name="suggestAnalyzerFieldType">string</str>
      <str name="buildOnStartup">false</str>
    </lst>
  </searchComponent>

根據當前使用到的suggest組件，來繪製一份類圖幫助理解整體過程：

LookupFactory可以根據當前使用到的SolrCore和配置項來創建一個Lucene Suggester（Lookup）組件，我們使用到的InputIterator是根據Directory類來提供的，這兩個類均存在對應的工廠類。

我可以根據需要，選擇不同的Suggester類，以及對應Directionary組合來共同完成suggest提示。

在requestHandler中也需要加入聲明來進行/suggest，以相應http GET請求：

<requestHandler name="/suggest" class="org.apache.solr.handler.component.SearchHandler" 
                  startup="lazy" >
    <lst name="defaults">
      <str name="suggest">true</str>
      <str name="suggest.count">10</str>
    </lst>
    <arr name="components">
      <str>suggest</str>
    </arr>
  </requestHandler>

爲了驗證各種類型的Suggester，我們可以在本地加入測試用例，開展測試相關工作。

在AnalyzingInfixSuggester中，InputIterator的使用方式如下：

writer = new IndexWriter(dir,
                               getIndexWriterConfig(getGramAnalyzer(), IndexWriterConfig.OpenMode.CREATE));
      BytesRef text;
      while ((text = iter.next()) != null) {
        BytesRef payload;
        if (iter.hasPayloads()) {
          payload = iter.payload();
        } else {
          payload = null;
        }

        add(text, iter.contexts(), iter.weight(), payload);
      }

FieldType中存在兩種Analyzer，index和query，在fieldType中進行配置。type string和text的主要區別在於是否會進行analyze，string是不需要的，當做一整個單詞，而text需要。

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

應用場景示例

假設我們有一張品牌關鍵字表，需要可以根據品牌的拼音搜索到對應的品牌名稱，我們在solr中使用下面的db-data-import語句來進行導入操作：

 <entity name="gt_brand" query="
select brand_id, brand_name, brand_pinyin, brand_name_second, sort from gt_goods_brand
" >
        <field column="brand_id" name="id"/>
        <field column="brand_name" name="brand_name"/>
        <field column="brand_pinyin" name="brand_pinyin"/>
        <field column="brand_name_second" name="brand_name_second"/>
        <field column="sort" name="sort"/>
    </entity>

其中brand_pinyin作爲關鍵詞，sort作爲權重（weight），brand_name爲搜索後真正顯示的文本

Directory indexDir = FSDirectory.open(Paths.get("/Users/xxx/develop/tools/solr-5.5.0/server/solr/suggest/data/index"));
        StandardAnalyzer analyzer = new StandardAnalyzer();
        AnalyzingInfixSuggester suggester = new AnalyzingInfixSuggester(indexDir, analyzer);


        DirectoryReader directoryReader = DirectoryReader.open(indexDir);
        DocumentDictionary documentDictionary = new DocumentDictionary(directoryReader, "brand_pinyin", "sort", "brand_name");
        suggester.build(documentDictionary.getEntryIterator());

        List<Lookup.LookupResult> cha = suggester.lookup("nijiazhubao", 5, false, false);
        for (Lookup.LookupResult lookupResult : cha) {
//            System.out.println(lookupResult.key);
//            System.out.println(lookupResult.value);
            System.out.println(new String(lookupResult.payload.bytes, "UTF8"));
        }

<str name="field">brand_pinyin</str>
      <str name="weightField">sort</str>
      <str name="payloadField">brand_name</str>
      <str name="suggestAnalyzerFieldType">string</str>
      <str name="buildOnStartup">true</str>

注意，處理的field一定需要有相應的analyzer(index, search)才能suggest出來：

如何使用兩個字段來聯想

http://eksliang.iteye.com/blog/2097924

視圖去建立多個searchComponent，因爲searchHandler可以包含多個searchComponent的名稱，但並沒有奏效：

<searchComponent name="suggest" class="solr.SuggestComponent">
    <lst name="suggester">
      <str name="name">default</str>
      <str name="lookupImpl">FuzzyLookupFactory</str>      <!-- org.apache.solr.spelling.suggest.fst -->
      <str name="dictionaryImpl">DocumentDictionaryFactory</str>     <!-- org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory --> 
      <str name="field">category_name</str>
      <str name="weightField"></str>
      <str name="suggestAnalyzerFieldType">string</str>
    </lst>
  </searchComponent>

  <searchComponent name="suggest1" class="solr.SuggestComponent">
   <lst name="suggester">
      <str name="name">default</str>
      <str name="lookupImpl">FuzzyLookupFactory</str>      <!-- org.apache.solr.spelling.suggest.fst -->
      <str name="dictionaryImpl">DocumentDictionaryFactory</str>     <!-- org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory --> 
      <str name="field">brand_name</str>
      <str name="weightField"></str>
      <str name="suggestAnalyzerFieldType">string</str>
    </lst>
  </searchComponent>

  <requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
    <lst name="defaults">
      <str name="suggest">true</str>
      <str name="suggest.count">5</str>
    </lst>
    <arr name="components">
      <str>suggest</str>
      <str>suggest1</str>
    </arr>
  </requestHandler>

出現問題：

suggest: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /Users/xxx/develop/tools/solr-5.5.0/server/solr/suggest/data/analyzingInfixSuggesterIndexDir/write.lock

這其實也是indexPath導致的問題，當存在多個suggester配置的時候，需要將其索引對應的目錄分開（至少使用AnalyzingInfixLookupFactory的時候是這樣的，看源碼可以設置爲相對於core/data目錄的相對路徑：

String indexPath = params.get(INDEX_PATH) != null
    ? params.get(INDEX_PATH).toString()
    : DEFAULT_INDEX_PATH;
    if (new File(indexPath).isAbsolute() == false) {
      indexPath = core.getDataDir() + File.separator + indexPath;
    }

但我們加入<str name=“indexPath”>xxx</str>，雖然Exception已經消除，但是查詢也沒有起作用，只能採用另外的方案來處理，將多個字段copy至同一個字段，以便能夠對單獨的字段進行suggest提示，參考：http://stackoverflow.com/questions/7712606/solr-suggester-multiple-field-autocomplete

https://issues.apache.org/jira/browse/SOLR-5529，該ISSUE中也提供瞭解決方案，但是沒有試驗成功~

查看圖片附件

Solr Suggest組件的使用

linux安裝cuda和cudnn

模擬手機設備：使用 Playwright 實現移動端自動化測試

Mellanox網卡開啓SR-IOV

全面系統的AI學習路徑，幫助普通人也能玩轉AI

HTML 00 Tutorial

uni-app實現上拉加載

vue3編譯優化之“靜態提升”

又是一個月-20240513

flask 如何保證返回json有序

linux服務器設置ssh免密

APP推送通知相關實現

Spring主從數據源動態切換

Solr-DIH建立索引並執行簡單初步的查詢

Solr Suggest組件的使用

基於Redis實現簡單的分佈式鎖

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結