多個MUST的倒排表合併

ConjunctionDISI中的一些變量跟方法

// lead1封裝的是cost值(包含term的文檔的個數)最小的term的一些信息
// lead2封裝的是cost值(包含term的文檔的個數)比lead1的cost值大的term的一些信息，同時lead2的cost值比others中任意一個的cost值都小(或相等)
final DocIdSetIterator lead1, lead2;
// others數組中所有對象的cost值都不小於lead1跟lead2中的cost值
final DocIdSetIterator[] others;

下面的方法用來對DocIdSetIterator對象進行排序，排序的規則即比較cost值，也就是包含一個term的文檔個數，最後給 lead1，lead2，others[]數組進行賦值

private ConjunctionDISI(List<? extends DocIdSetIterator> iterators) {
   assert iterators.size() >= 2;
   // Sort the array the first time to allow the least frequent DocsEnum to
   // lead the matching.
   // 優化版的歸併排序
   CollectionUtil.timSort(iterators, new Comparator<DocIdSetIterator>() {
     @Override
     public int compare(DocIdSetIterator o1, DocIdSetIterator o2) {
       // 包含這個term的文檔個數, 文檔個數越小，排在最前面
       return Long.compare(o1.cost(), o2.cost());
    }
  });
   // 將cost值最小的賦值給lead1
   lead1 = iterators.get(0);
   // 將cost值次小的賦值給lead2
   lead2 = iterators.get(1);
   // 將其他的DocIdSetIterator賦值給others[]數組
   others = iterators.subList(2, iterators.size()).toArray(new DocIdSetIterator[0]);
}

由於滿足查詢要求的文檔必須都要包含查詢關鍵字，當我們得到每一個term(查詢關鍵字)對應的文檔號，對這些文檔號進行合併時，效率最高的一種合併方法就是，對包含某個term的文檔數量最少的進行文檔號的遍歷，每次遍歷一個文檔號，就與lead2跟others中找到有沒有相同的文檔號，如果存在，這個文檔號滿足查詢要求。

下面方法就是執行合併的過程：

private int doNext(int doc) throws IOException {
   advanceHead: for(;;) {
     // 取出當前正在使用的lead1的文檔號(文檔號總是從最小的開始遍歷)
     assert doc == lead1.docID();
     // find agreement between the two iterators with the lower costs
     // we special case them because they do not need the
     // 'other.docID() < doc' check that the 'others' iterators need
     // 取出當前正在使用的lead2的文檔號(文檔號總是從最小的開始遍歷)
     final int next2 = lead2.advance(doc);
     if (next2 != doc) {
       // 取出lead1的下一個文檔號
       doc = lead1.advance(next2);
       if (next2 != doc) {
          // 如果當前的lead1跟lead2的doc值不一樣，那麼就沒有必要去跟others[]中的去作合併
          // 所以繼續比較lead1跟lead2的下一個doc的值
         continue;
      }
    }
     // 運行至此說明找到了lead1跟lead2都相同的文檔號(除了遍歷結束的情況， doc的值是2147483647)
     // then find agreement with other iterators
     // 繼續遍歷所有others中，判斷是否有跟lead1和lead2相同的文檔號
     for (DocIdSetIterator other : others) {
       // other.doc may already be equal to doc if we "continued advanceHead"
       // on the previous iteration and the advance on the lead scorer exactly matched.
       // 注意的是每一個other.docID()的第一次調用的返回值是-1
       if (other.docID() < doc) {
         // 找下一個不小於doc的值
         final int next = other.advance(doc);
  // next要麼等於doc，要麼大於doc
         if (next > doc) {
           // iterator beyond the current doc - advance lead and continue to the new highest doc.
           // 當前的doc不滿足，那麼另doc爲下一個不小於next的值
           doc = lead1.advance(next);
           // 重新比較
           continue advanceHead;
        }
      }
    }
     // success - all iterators are on the same doc
     // 找到了包含所有關鍵字的文檔號
     return doc;
  }
}

例子

  文檔0：a
  文檔1：b
  文檔2：c
  文檔3：a c
  文檔4：h
  文檔5：c e
  文檔6：c a
  文檔7：f
  文檔8：c d e c e
  文檔9：a c e a b c

查詢關鍵字如下

BooleanQuery.Builder query = new BooleanQuery.Builder();
query.add(new TermQuery(new Term("content", "a")), BooleanClause.Occur.MUST);
query.add(new TermQuery(new Term("content", "c")), BooleanClause.Occur.MUST);
query.add(new TermQuery(new Term("content", "e")), BooleanClause.Occur.MUST);
query.add(new TermQuery(new Term("content", "b")), BooleanClause.Occur.MUST);

包含關鍵字“a”的文檔

0，3，6，9

包含關鍵字“c”的文檔

2，3，5，6，8，9

包含關鍵字“e”的文檔

5，8，9

包含關鍵字“b”的文檔

1，9

經過排序後，lead1中存放的是包含"b"的文檔號，lead2中存放的是包含"e"的文檔號，而others[]數組中中存放的則是包含"a"跟包含"c"的文檔號，爲了更加易懂的理解合併過程，我們認爲存放文檔號的是一個數組(實際也是數組存儲，不過使用了差值存儲，這裏不作解釋，以後會有介紹)：

合併過程

從lead1的數組中取出第一個文檔號doc1的值 1，然後從lead2的數組中取出不小於 doc1 (1)的文檔號doc2的值，也就是5。

比較的結果就是 doc1 (1) ≠ doc2 (5)，那麼沒有必要繼續跟other[]中的其他數組作比較了。接着繼續從lead1的數組中取出不小於doc2 (5)的值，也就是 9，doc1更新爲9，然後再從lead2的數組中取出不小於doc1(9)，也就是doc2的值被更新爲 9：

比較的結果就是 doc1 (9) = doc2 (9), 那麼我們就需要跟other[]中的其他數組元素進行比較了，從other[]中的第一個數組Array1中取出不小於doc1 (9)的文檔號doc3的值，也就是 9：

這時候由於 doc1 (9) = doc3 (9)，所以需要繼續跟other[]中的第二個數組Array2中的元素進行比較，從Array2中取出不小於doc1 (9)的文檔號doc4的值，也就是9:

至此所有的數組中都包含文檔號 9，那麼這個文檔就是滿足查詢需求的。

多個MUST的倒排表合併

ConjunctionDISI中的一些變量跟方法

例子

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

py中的排序

init.py main.py

兩種UnboundLocalError: local variable 'xxx' referenced before assignment情況的解決方法

Numpy基本操作

SpringCloudStream

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結