拆解Cluene系列(10)——詳解索引的合併(一)

原創

聪明的狐狸

2020-06-28 04:55

前一篇博文提到索引的合併，在SegmentMerger.merge()中，主要包含以下幾部分：

合併域：mergeFields()
合併詞典和倒排表：mergeTerms();
合併標準化因子：mergeNorms();
合併詞向量：mergeVectors();
下面詳細介紹這幾部分：
- 合併域：mergeFields()

主要包含兩部分：一部分是合併fnm信息，即域元數據信息，一部分是合併fdt,fdx信息，也即域數據信息。

合併fnm信息
• 首先生成新的域元數據信息：fieldInfos = new FieldInfos();
• 依次用reader讀取每個合併段的域元數據信息，加入上述對象,代碼如下：

   fieldInfos = _CLNEW FieldInfos(); // merge field names
    SegmentReader* reader = NULL;
    int32_t docCount = 0;

    //Iterate through all readers
    for (uint32_t i = 0; i < readers.size(); i++){

        reader = readers[i];

        TCHAR** tmp = NULL;

        tmp = reader->getIndexedFieldNames(true);//獲取需要建立索引的field,不需要存儲TermVector
        fieldInfos->add((const TCHAR**)tmp, true, true);

        tmp = reader->getIndexedFieldNames(false);//獲取需要建立索引的field,需要存儲TermVector
        fieldInfos->add((const TCHAR**)tmp, true, false);


        tmp = reader->getFieldNames(false);//獲取需要建立索引的field
        fieldInfos->add((const TCHAR**)tmp, false, false);  
    }

合併段數據信息fdt,fdx

使用indexReader讀取所有要合併的數據。並添加到FieldWriter中：僞代碼如下：

FieldsWriter* fieldsWriter = _CLNEW FieldsWriter(directory, segment, fieldInfos);

    try {  
        IndexReader* reader = NULL;
        int32_t maxDoc          = 0;

        //Iterate through all readers
        for (uint32_t i = 0; i < readers.size(); i++) {
            //get the i-th reader
            reader = (SegmentReader*)readers[i];
            int32_t maxDoc = reader->maxDoc();

            //Iterate through all the documents managed by the current reader
            for (int32_t j = 0; j < maxDoc; j++){
                //Check if the j-th document has been deleted, if so skip it
                if (!reader->isDeleted(j)){ 
                    //Get the document
                    Document* doc = reader->document(j);
                    //Add the document to the new FieldsWriter
                    fieldsWriter->addDocument( doc );
                    docCount++;
                    //doc is not used anymore so have it deleted
                    _CLDELETE(doc);
                }
            }
        }
    }_CLFINALLY(
        fieldsWriter->close();
        _CLDELETE( fieldsWriter );
    );

合併標準化因子
合併標準化因子的過程比較簡單，和合並Field 數據類似，基本就是對每一個域，用指向合併段的reader讀出標準化因子，然後再寫入新生成的段。

void SegmentMerger::mergeNorms() 
{
    IndexReader* reader  = NULL;
    OutputStream*  output  = NULL;

    //iterate through all the Field Infos instances
    for (int32_t i = 0; i < fieldInfos->size(); i++) {

        FieldInfo* fi = fieldInfos->fieldInfo(i);

        if (fi->isIndexed){
            //Create an new filename for the norm file
            const char* buf = Misc::segmentname(segment,".f", i);
            output = directory->createFile( buf );
            _CLDELETE_CaARRAY( buf );


            //Iterate throug all SegmentReaders
            for (uint32_t j = 0; j < readers.size(); j++) {
                //Get the i-th IndexReader
                reader = readers[j];


                //Get an InputStream to the norm file for this field in this segment
                uint8_t* input = reader->norms(fi->name);

                //Get the total number of documents including the documents that have been marked deleted
                int32_t maxDoc = reader->maxDoc();
                //Iterate through all the documents
                for(int32_t k = 0; k < maxDoc; k++) {
                    uint8_t norm = input != NULL ? input[k] : 0;

                    //Check if document k is deleted
                    if (!reader->isDeleted(k)){
                        //write the new norm
                        output->writeByte(norm);
                    }
                }
            } 

            if (output != NULL){
                //Close the OutputStream output
                output->close();
                //destroy it
                _CLDELETE(output);
            }

        }
    }
}

合併詞向量：mergeVectors()
合併詞向量的過程和合並Norms過程非常類似，再此不再敘述。
合併詞典和倒排表
　　以上都是合併正向信息，相對過程比較清晰。而合併詞典和倒排表就不這麼簡單了，因爲在詞典中，Clucene要求按照字典順序排序，在倒排表中，文檔號是個內部編號，要按照從小到大順序排序排序，在每個seg中，文檔號都是從零開始編號的。

所以反向信息的合併包括兩部分：

對詞典的合併，需要對詞典中的Term進行重新排序
對於相同的Term，對包含此Term的文檔號列表進行合併，需要對文檔號重新編號。

　　後者相對簡單，假設如果第一個seg的編號是0~N，第二個seg的編號是0~M，當兩個seg合併成一個seg的時候，第一個seg的編號依然是0~N，第二個seg的編號變成N~N+M就可以了，也即增加一個偏移量(前一個seg的文檔個數)。

　　對詞典的合併需要找出兩個seg中相同的詞，Clucene是通過一個SegmentMergeInfo類型的數組以及稱爲queue的SegmentMergeQueue實現的，SegmentMergeQueue是繼承於PriorityQueue，是一個優先級隊列，是按照字典順序排序的。SegmentMergeInfo保存要合併的seg的詞典及倒排表信息，在SegmentMergeQueue中用來排序的key是它代表的seg中的Term。

　　我們來舉一個例子來說明合併詞典的過程，以便後面解析代碼的時候能夠很好的理解：
假設要合併五個seg，每個seg包含的Term也是按照字典順序排序的，如下圖所示。
首先把五個seg全部放入優先級隊列中，每個seg在其中也是按照第一個Term的字典順序排序的，如下圖。
1. 從優先級隊列中彈出第一個Term(“a”)放到match數組中。
2. 尋找含有Term(“a”)的其他seg從隊列中彈出也放到match數組中。（圖2）
3. 合併這些seg的第一個Term(“a”)的倒排表，並把此Term和它的倒排表一同加入新生成的seg中。
將match數組中還有Term的seg重新放入優先級隊列中. 優先級隊列變成下面的樣子.(圖3)
跳轉到1,直到隊列爲空。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

拆解Cluene系列(10)——詳解索引的合併(一)

ziw2pdf

apisix~helm方式的部署到k8s

firmeye - IoT固件漏洞挖掘工具

拆解Cluene系列(6)——Analyzer的職責鏈模式

拆解Clucene 系列(2)——Clucene的幾個專業術語

拆解Clucene系列(3)——Clucene的代碼組織結構

拆解Cluene系列(10)——詳解索引的合併(一)

職業生涯瓶頸期

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結