Lucene學習總結之四:Lucene索引過程分析(4)

6、關閉IndexWriter對象

代碼:

writer.close();

--> IndexWriter.closeInternal(boolean)

      --> (1) 將索引信息由內存寫入磁盤: flush(waitForMerges, true, true); 
      --> (2) 進行段合併: mergeScheduler.merge(this);

對段的合併將在後面的章節進行討論,此處僅僅討論將索引信息由寫入磁盤的過程。

代碼:

IndexWriter.flush(boolean triggerMerge, boolean flushDocStores, boolean flushDeletes)

--> IndexWriter.doFlush(boolean flushDocStores, boolean flushDeletes)

      --> IndexWriter.doFlushInternal(boolean flushDocStores, boolean flushDeletes)

將索引寫入磁盤包括以下幾個過程:

  • 得到要寫入的段名:String segment = docWriter.getSegment();
  • DocumentsWriter將緩存的信息寫入段:docWriter.flush(flushDocStores);
  • 生成新的段信息對象:newSegment = new SegmentInfo(segment, flushedDocCount, directory, false, true, docStoreOffset, docStoreSegment, docStoreIsCompoundFile, docWriter.hasProx());
  • 準備刪除文檔:docWriter.pushDeletes();
  • 生成cfs段:docWriter.createCompoundFile(segment);
  • 刪除文檔:applyDeletes();

6.1、得到要寫入的段名

代碼:

SegmentInfo newSegment = null;

final int numDocs = docWriter.getNumDocsInRAM();//文檔總數

String docStoreSegment = docWriter.getDocStoreSegment();//存儲域和詞向量所要要寫入的段名,"_0"   

int docStoreOffset = docWriter.getDocStoreOffset();//存儲域和詞向量要寫入的段中的偏移量

String segment = docWriter.getSegment();//段名,"_0"

在Lucene的索引文件結構一章做過詳細介紹,存儲域和詞向量可以和索引域存儲在不同的段中。

6.2、將緩存的內容寫入段

代碼:

flushedDocCount = docWriter.flush(flushDocStores);

此過程又包含以下兩個階段;

  • 按照基本索引鏈關閉存儲域和詞向量信息
  • 按照基本索引鏈的結構將索引結果寫入段

6.2.1、按照基本索引鏈關閉存儲域和詞向量信息

代碼爲:

closeDocStore();

flushState.numDocsInStore = 0;

其主要是根據基本索引鏈結構,關閉存儲域和詞向量信息:

  • consumer(DocFieldProcessor).closeDocStore(flushState);
    • consumer(DocInverter).closeDocStore(state);
      • consumer(TermsHash).closeDocStore(state);
        • consumer(FreqProxTermsWriter).closeDocStore(state);
        • if (nextTermsHash != null) nextTermsHash.closeDocStore(state);
          • consumer(TermVectorsTermsWriter).closeDocStore(state);
      • endConsumer(NormsWriter).closeDocStore(state);
    • fieldsWriter(StoredFieldsWriter).closeDocStore(state);

其中有實質意義的是以下兩個closeDocStore:

  • 詞向量的關閉:TermVectorsTermsWriter.closeDocStore(SegmentWriteState)

void closeDocStore(final SegmentWriteState state) throws IOException {

                     if (tvx != null) { 
            //爲不保存詞向量的文檔在tvd文件中寫入零。即便不保存詞向量,在tvx, tvd中也保留一個位置 
            fill(state.numDocsInStore - docWriter.getDocStoreOffset()); 
            //關閉tvx, tvf, tvd文件的寫入流 
            tvx.close(); 
            tvf.close(); 
            tvd.close(); 
            tvx = null; 
            //記錄寫入的文件名,爲以後生成cfs文件的時候,將這些寫入的文件生成一個統一的cfs文件。 
            state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION); 
            state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION); 
            state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION); 
            //從DocumentsWriter的成員變量openFiles中刪除,未來可能被IndexFileDeleter刪除 
            docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION); 
            docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION); 
            docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION); 
            lastDocID = 0; 
        }     
}
  • 存儲域的關閉:StoredFieldsWriter.closeDocStore(SegmentWriteState)

public void closeDocStore(SegmentWriteState state) throws IOException {

    //關閉fdx, fdt寫入流

    fieldsWriter.close(); 
    --> fieldsStream.close(); 
    --> indexStream.close(); 
    fieldsWriter = null; 
    lastDocID = 0;

    //記錄寫入的文件名 
    state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_EXTENSION); 
    state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION); 
    state.docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_EXTENSION); 
    state.docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION); 
}


6.2.2、按照基本索引鏈的結構將索引結果寫入段

代碼爲:

consumer(DocFieldProcessor).flush(threads, flushState);

    //回收fieldHash,以便用於下一輪的索引,爲提高效率,索引鏈中的對象是被複用的。

    Map> childThreadsAndFields = new HashMap>(); 
    for ( DocConsumerPerThread thread : threads) { 
        DocFieldProcessorPerThread perThread = (DocFieldProcessorPerThread) thread; 
        childThreadsAndFields.put(perThread.consumer, perThread.fields()); 
        perThread.trimFields(state); 
    }

    //寫入存儲域

    --> fieldsWriter(StoredFieldsWriter).flush(state);

    //寫入索引域

    --> consumer(DocInverter).flush(childThreadsAndFields, state);

    //寫入域元數據信息,並記錄寫入的文件名,以便以後生成cfs文件

    --> final String fileName = state.segmentFileName(IndexFileNames.FIELD_INFOS_EXTENSION);

    --> fieldInfos.write(state.directory, fileName);

    --> state.flushedFiles.add(fileName);

此過程也是按照基本索引鏈來的:

  • consumer(DocFieldProcessor).flush(…);
    • consumer(DocInverter).flush(…);
      • consumer(TermsHash).flush(…);
        • consumer(FreqProxTermsWriter).flush(…);
        • if (nextTermsHash != null) nextTermsHash.flush(…);
          • consumer(TermVectorsTermsWriter).flush(…);
      • endConsumer(NormsWriter).flush(…);
    • fieldsWriter(StoredFieldsWriter).flush(…);

6.2.2.1、寫入存儲域

代碼爲:

StoredFieldsWriter.flush(SegmentWriteState state) { 
    if (state.numDocsInStore > 0) { 
      initFieldsWriter(); 
      fill(state.numDocsInStore - docWriter.getDocStoreOffset()); 
    } 
    if (fieldsWriter != null) 
      fieldsWriter.flush(); 
  }

從代碼中可以看出,是寫入fdx, fdt兩個文件,但是在上述的closeDocStore已經寫入了,並且把state.numDocsInStore置零,fieldsWriter設爲null,在這裏其實什麼也不做。

6.2.2.2、寫入索引域

代碼爲:

DocInverter.flush(Map>, SegmentWriteState)

    //寫入倒排表及詞向量信息

    --> consumer(TermsHash).flush(childThreadsAndFields, state);

    //寫入標準化因子

    --> endConsumer(NormsWriter).flush(endChildThreadsAndFields, state);

6.2.2.2.1、寫入倒排表及詞向量信息

代碼爲:

TermsHash.flush(Map>, SegmentWriteState)

    //寫入倒排表信息

    --> consumer(FreqProxTermsWriter).flush(childThreadsAndFields, state);

   //回收RawPostingList

    --> shrinkFreePostings(threadsAndFields, state);

    //寫入詞向量信息

    --> if (nextTermsHash != null) nextTermsHash.flush(nextThreadsAndFields, state);

          --> consumer(TermVectorsTermsWriter).flush(childThreadsAndFields, state);

6.2.2.2.1.1、寫入倒排表信息

代碼爲:

FreqProxTermsWriter.flush(Map                                       Collection>, SegmentWriteState)

    (a) 所有域按名稱排序,使得同名域能夠一起處理

    Collections.sort(allFields);

    final int numAllFields = allFields.size();

    (b) 生成倒排表的寫對象

    final FormatPostingsFieldsConsumer consumer = new FormatPostingsFieldsWriter(state, fieldInfos);

    int start = 0;

    (c) 對於每一個域

    while(start < numAllFields) {

        (c-1) 找出所有的同名域

        final FieldInfo fieldInfo = allFields.get(start).fieldInfo;

        final String fieldName = fieldInfo.name;

        int end = start+1;

        while(end < numAllFields && allFields.get(end).fieldInfo.name.equals(fieldName))

            end++;

        FreqProxTermsWriterPerField[] fields = new FreqProxTermsWriterPerField[end-start];

        for(int i=start;i

            fields[i-start] = allFields.get(i);

            fieldInfo.storePayloads |= fields[i-start].hasPayloads;

        }

        (c-2) 將同名域的倒排表添加到文件

        appendPostings(fields, consumer);

       (c-3) 釋放空間

        for(int i=0;i

            TermsHashPerField perField = fields[i].termsHashPerField;

            int numPostings = perField.numPostings;

            perField.reset();

            perField.shrinkHash(numPostings);

            fields[i].reset();

        }

        start = end;

    }

    (d) 關閉倒排表的寫對象

    consumer.finish();

(b) 生成倒排表的寫對象

代碼爲:

public FormatPostingsFieldsWriter(SegmentWriteState state, FieldInfos fieldInfos) throws IOException { 
    dir = state.directory; 
    segment = state.segmentName; 
    totalNumDocs = state.numDocs; 
    this.fieldInfos = fieldInfos; 
    //用於寫tii,tis 
    termsOut = new TermInfosWriter(dir, segment, fieldInfos, state.termIndexInterval); 
    //用於寫freq, prox的跳錶  
    skipListWriter = new DefaultSkipListWriter(termsOut.skipInterval, termsOut.maxSkipLevels, totalNumDocs, null, null); 
    //記錄寫入的文件名, 
    state.flushedFiles.add(state.segmentFileName(IndexFileNames.TERMS_EXTENSION)); 
    state.flushedFiles.add(state.segmentFileName(IndexFileNames.TERMS_INDEX_EXTENSION));  
    //用以上兩個寫對象,按照一定的格式寫入段 
    termsWriter = new FormatPostingsTermsWriter(state, this); 
}

對象結構如下:

consumer    FormatPostingsFieldsWriter  (id=119)  //用於處理一個域 
    dir    SimpleFSDirectory  (id=126)   //目標索引文件夾 
    totalNumDocs    8   //文檔總數 
    fieldInfos    FieldInfos  (id=70)  //域元數據信息   
    segment    "_0"   //段名 
    skipListWriter    DefaultSkipListWriter  (id=133)  //freq, prox中跳錶的寫對象   
    termsOut    TermInfosWriter  (id=125)  //tii, tis文件的寫對象 
    termsWriter    FormatPostingsTermsWriter  (id=135)  //用於添加詞(Term) 
        currentTerm    null    
        currentTermStart    0    
        fieldInfo    null    
        freqStart    0    
        proxStart    0    
        termBuffer    null    
        termsOut    TermInfosWriter  (id=125)    
        docsWriter    FormatPostingsDocsWriter  (id=139)  //用於寫入此詞的docid, freq信息 
            df    0    
            fieldInfo    null    
            freqStart    0    
            lastDocID    0    
            omitTermFreqAndPositions    false    
            out    SimpleFSDirectory$SimpleFSIndexOutput  (id=144)    
            skipInterval    16    
            skipListWriter    DefaultSkipListWriter  (id=133)    
            storePayloads    false    
            termInfo    TermInfo  (id=151)    
            totalNumDocs    8     
            posWriter    FormatPostingsPositionsWriter  (id=146)  //用於寫入此詞在此文檔中的位置信息   
                lastPayloadLength    -1    
                lastPosition    0    
                omitTermFreqAndPositions    false    
                out    SimpleFSDirectory$SimpleFSIndexOutput  (id=157)    
                parent    FormatPostingsDocsWriter  (id=139)    
                storePayloads    false   
  • FormatPostingsFieldsWriter.addField(FieldInfo field)用於添加索引域信息,其返回FormatPostingsTermsConsumer用於添加詞信息
  • FormatPostingsTermsConsumer.addTerm(char[] text, int start)用於添加詞信息,其返回FormatPostingsDocsConsumer用於添加freq信息
  • FormatPostingsDocsConsumer.addDoc(int docID, int termDocFreq)用於添加freq信息,其返回FormatPostingsPositionsConsumer用於添加prox信息
  • FormatPostingsPositionsConsumer.addPosition(int position, byte[] payload, int payloadOffset, int payloadLength)用於添加prox信息

(c-2) 將同名域的倒排表添加到文件

代碼爲:

FreqProxTermsWriter.appendPostings(FreqProxTermsWriterPerField[], FormatPostingsFieldsConsumer) {

    int numFields = fields.length;

    final FreqProxFieldMergeState[] mergeStates = new FreqProxFieldMergeState[numFields];

    for(int i=0;i

      FreqProxFieldMergeState fms = mergeStates[i] = new FreqProxFieldMergeState(fields[i]);

      boolean result = fms.nextTerm(); //對所有的域,取第一個詞(Term)

    }

    (1) 添加此域,雖然有多個域,但是由於是同名域,只取第一個域的信息即可。返回的是用於添加此域中的詞的對象。

    final FormatPostingsTermsConsumer termsConsumer = consumer.addField(fields[0].fieldInfo);

    FreqProxFieldMergeState[] termStates = new FreqProxFieldMergeState[numFields];

    final boolean currentFieldOmitTermFreqAndPositions = fields[0].fieldInfo.omitTermFreqAndPositions;

    (2) 此while循環是遍歷每一個尚有未處理的詞的域,依次按照詞典順序處理這些域所包含的詞。當一個域中的所有的詞都被處理過後,則numFields減一,並從mergeStates數組中移除此域。直到所有的域的所有的詞都處理完畢,方纔退出此循環。

    while(numFields > 0) {

       (2-1) 找出所有域中按字典順序的下一個詞。可能多個同名域中,都包含同一個term,因而要遍歷所有的numFields,得到所有的域裏的下一個詞,numToMerge即爲有多少個域包含此詞。

      termStates[0] = mergeStates[0];

      int numToMerge = 1;

      for(int i=1;i

        final char[] text = mergeStates[i].text;

        final int textOffset = mergeStates[i].textOffset;

        final int cmp = compareText(text, textOffset, termStates[0].text, termStates[0].textOffset);

        if (cmp < 0) {

          termStates[0] = mergeStates[i];

          numToMerge = 1;

        } else if (cmp == 0)

          termStates[numToMerge++] = mergeStates[i];

      }

      (2-2) 添加此詞,返回FormatPostingsDocsConsumer用於添加文檔號(doc ID)及詞頻信息(freq)

      final FormatPostingsDocsConsumer docConsumer = termsConsumer.addTerm(termStates[0].text, termStates[0].textOffset);

      (2-3) 由於共numToMerge個域都包含此詞,每個詞都有一個鏈表的文檔號表示包含這些詞的文檔。此循環遍歷所有的包含此詞的域,依次按照從小到大的循序添加包含此詞的文檔號及詞頻信息。當一個域中對此詞的所有文檔號都處理過了,則numToMerge減一,並從termStates數組中移除此域。當所有包含此詞的域的所有文檔號都處理過了,則結束此循環。

      while(numToMerge > 0) {

        (2-3-1) 找出最小的文檔號

        FreqProxFieldMergeState minState = termStates[0];

        for(int i=1;i

          if (termStates[i].docID < minState.docID)

            minState = termStates[i];

        final int termDocFreq = minState.termFreq;

        (2-3-2) 添加文檔號及詞頻信息,並形成跳錶,返回FormatPostingsPositionsConsumer用於添加位置(prox)信息

        final FormatPostingsPositionsConsumer posConsumer = docConsumer.addDoc(minState.docID, termDocFreq);

        //ByteSliceReader是用於讀取bytepool中的prox信息的。

        final ByteSliceReader prox = minState.prox;

        if (!currentFieldOmitTermFreqAndPositions) {

          int position = 0;

          (2-3-3) 此循環對包含此詞的文檔,添加位置信息

          for(int j=0;j

            final int code = prox.readVInt();

            position += code >> 1;

            final int payloadLength;

            // 如果此位置有payload信息,則從bytepool中讀出,否則設爲零。

            if ((code & 1) != 0) {

              payloadLength = prox.readVInt();

              if (payloadBuffer == null || payloadBuffer.length < payloadLength)

                payloadBuffer = new byte[payloadLength];

              prox.readBytes(payloadBuffer, 0, payloadLength);

            } else

              payloadLength = 0;

              //添加位置(prox)信息

              posConsumer.addPosition(position, payloadBuffer, 0, payloadLength);

          }

          posConsumer.finish();

        }

       (2-3-4) 判斷退出條件,上次選中的域取得下一個文檔號,如果沒有,則說明此域包含此詞的文檔已經處理完畢,則從termStates中刪除此域,並將numToMerge減一。然後此域取得下一個詞,當循環到(2)的時候,表明此域已經開始處理下一個詞。如果沒有下一個詞,說明此域中的所有的詞都處理完畢,則從mergeStates中刪除此域,並將numFields減一,當numFields爲0的時候,循環(2)也就結束了。

        if (!minState.nextDoc()) {//獲得下一個docid

          //如果此域包含此詞的文檔已經沒有下一篇docid,則從數組termStates中移除,numToMerge減一。

          int upto = 0;

          for(int i=0;i

            if (termStates[i] != minState)

              termStates[upto++] = termStates[i];

          numToMerge--;

          //此域則取下一個詞(term),在循環(2)處來參與下一個詞的合併

          if (!minState.nextTerm()) {

            //如果此域沒有下一個詞了,則此域從數組mergeStates中移除,numFields減一。

            upto = 0;

            for(int i=0;i

              if (mergeStates[i] != minState)

                mergeStates[upto++] = mergeStates[i];

            numFields--;

          }

        }

      }

      (2-4) 經過上面的過程,docid和freq信息雖已經寫入段文件,而跳錶信息並沒有寫到文件中,而是寫入skip buffer裏面了,此處真正寫入文件。並且詞典(tii, tis)也應該寫入文件。

      docConsumer(FormatPostingsDocsWriter).finish();

    }

    termsConsumer.finish();

  }

(2-3-4) 獲得下一篇文檔號代碼如下:

public boolean nextDoc() {//如何獲取下一個docid

  if (freq.eof()) {//如果bytepool中的freq信息已經讀完

    if (p.lastDocCode != -1) {//由上述緩存管理,PostingList裏面還存着最後一篇文檔的文檔號及詞頻信息,則將最後一篇文檔返回

      docID = p.lastDocID;

      if (!field.omitTermFreqAndPositions)

        termFreq = p.docFreq;

      p.lastDocCode = -1;

      return true;

    } else

      return false;//沒有下一篇文檔

  }

  final int code = freq.readVInt();//如果bytepool中的freq信息尚未讀完

  if (field.omitTermFreqAndPositions)

    docID += code;

  else {

    //讀出文檔號及詞頻信息。

    docID += code >>> 1;

    if ((code & 1) != 0)

      termFreq = 1;

    else

      termFreq = freq.readVInt();

  }

  return true;

}

(2-3-2) 添加文檔號及詞頻信息代碼如下:

FormatPostingsPositionsConsumer FormatPostingsDocsWriter.addDoc(int docID, int termDocFreq) {

    final int delta = docID - lastDocID;

    //當文檔數量達到skipInterval倍數的時候,添加跳錶項。

    if ((++df % skipInterval) == 0) {

      skipListWriter.setSkipData(lastDocID, storePayloads, posWriter.lastPayloadLength);

      skipListWriter.bufferSkip(df);

   }

   lastDocID = docID;

   if (omitTermFreqAndPositions)

     out.writeVInt(delta);

   else if (1 == termDocFreq)

     out.writeVInt((delta<<1) | 1);

   else {

     //寫入文檔號及詞頻信息。

     out.writeVInt(delta<<1);

     out.writeVInt(termDocFreq);

   }

   return posWriter;

}

(2-3-3) 添加位置信息:

FormatPostingsPositionsWriter.addPosition(int position, byte[] payload, int payloadOffset, int payloadLength) {

    final int delta = position - lastPosition;

    lastPosition = position;

    if (storePayloads) {

        //保存位置及payload信息

        if (payloadLength != lastPayloadLength) {

            lastPayloadLength = payloadLength;

            out.writeVInt((delta<<1)|1);

            out.writeVInt(payloadLength);

        } else

            out.writeVInt(delta << 1);

            if (payloadLength > 0)

                out.writeBytes(payload, payloadLength);

    } else

        out.writeVInt(delta);

}

(2-4) 將跳錶和詞典(tii, tis)寫入文件

FormatPostingsDocsWriter.finish() {

    //將跳錶緩存寫入文件

    long skipPointer = skipListWriter.writeSkip(out);

    if (df > 0) {

      //將詞典(terminfo)寫入tii,tis文件

      parent.termsOut(TermInfosWriter).add(fieldInfo.number, utf8.result, utf8.length, termInfo);

    }

  }

將跳錶緩存寫入文件:

DefaultSkipListWriter(MultiLevelSkipListWriter).writeSkip(IndexOutput)  {

    long skipPointer = output.getFilePointer();

    if (skipBuffer == null || skipBuffer.length == 0) return skipPointer;

    //正如我們在索引文件格式中分析的那樣, 高層在前,低層在後,除最低層外,其他的層都有長度保存。

    for (int level = numberOfSkipLevels - 1; level > 0; level--) {

      long length = skipBuffer[level].getFilePointer();

      if (length > 0) {

        output.writeVLong(length);

        skipBuffer[level].writeTo(output);

      }

    }

    //寫入最低層

    skipBuffer[0].writeTo(output);

    return skipPointer;

  }

將詞典(terminfo)寫入tii,tis文件:

  • tii文件是tis文件的類似跳錶的東西,是在tis文件中每隔indexInterval個詞提取出一個詞放在tii文件中,以便很快的查找到詞。
  • 因而TermInfosWriter類型中有一個成員變量other也是TermInfosWriter類型的,還有一個成員變量isIndex來表示此對象是用來寫tii文件的還是用來寫tis文件的。
  • 如果一個TermInfosWriter對象的isIndex=false則,它是用來寫tis文件的,它的other指向的是用來寫tii文件的TermInfosWriter對象
  • 如果一個TermInfosWriter對象的isIndex=true則,它是用來寫tii文件的,它的other指向的是用來寫tis文件的TermInfosWriter對象

TermInfosWriter.add (int fieldNumber, byte[] termBytes, int termBytesLength, TermInfo ti) {

    //如果詞的總數是indexInterval的倍數,則應該寫入tii文件

    if (!isIndex && size % indexInterval == 0)

      other.add(lastFieldNumber, lastTermBytes, lastTermBytesLength, lastTi);

    //將詞寫入tis文件

    writeTerm(fieldNumber, termBytes, termBytesLength);

    output.writeVInt(ti.docFreq);                       // write doc freq

    output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers

    output.writeVLong(ti.proxPointer - lastTi.proxPointer);

    if (ti.docFreq >= skipInterval) {

      output.writeVInt(ti.skipOffset);

    }

    if (isIndex) {

      output.writeVLong(other.output.getFilePointer() - lastIndexPointer);

      lastIndexPointer = other.output.getFilePointer(); // write pointer

    }

    lastFieldNumber = fieldNumber;

    lastTi.set(ti);

    size++;

  }

6.2.2.2.1.2、寫入詞向量信息

代碼爲:

TermVectorsTermsWriter.flush (Map> 
                                              threadsAndFields, final SegmentWriteState state) {

    if (tvx != null) {

      if (state.numDocsInStore > 0)

        fill(state.numDocsInStore - docWriter.getDocStoreOffset());

      tvx.flush();

      tvd.flush();

      tvf.flush();

    }

    for (Map.Entry> entry : 
                                                                                                                                      threadsAndFields.entrySet()) {

      for (final TermsHashConsumerPerField field : entry.getValue() ) {

        TermVectorsTermsWriterPerField perField = (TermVectorsTermsWriterPerField) field;

        perField.termsHashPerField.reset();

        perField.shrinkHash();

      }

      TermVectorsTermsWriterPerThread perThread = (TermVectorsTermsWriterPerThread) entry.getKey();

      perThread.termsHashPerThread.reset(true);

    }

  }

從代碼中可以看出,是寫入tvx, tvd, tvf三個文件,但是在上述的closeDocStore已經寫入了,並且把tvx設爲null,在這裏其實什麼也不做,僅僅是清空postingsHash,以便進行下一輪索引時重用此對象。

6.2.2.2.2、寫入標準化因子

代碼爲:

NormsWriter.flush(Map> threadsAndFields,

                           SegmentWriteState state) {

    final Map> byField = new HashMap>();

    for (final Map.Entry> entry : 
                                                                                                                           threadsAndFields.entrySet()) {

     //遍歷所有的域,將同名域對應的NormsWriterPerField放到同一個鏈表中。

      final Collection fields = entry.getValue();

      final Iterator fieldsIt = fields.iterator();

      while (fieldsIt.hasNext()) {

        final NormsWriterPerField perField = (NormsWriterPerField) fieldsIt.next();

        List l = byField.get(perField.fieldInfo);

        if (l == null) {

            l = new ArrayList();

            byField.put(perField.fieldInfo, l);

        }

        l.add(perField);

    }

    //記錄寫入的文件名,方便以後生成cfs文件。

    final String normsFileName = state.segmentName + "." + IndexFileNames.NORMS_EXTENSION;

    state.flushedFiles.add(normsFileName);

    IndexOutput normsOut = state.directory.createOutput(normsFileName);

    try {

      //寫入nrm文件頭

      normsOut.writeBytes(SegmentMerger.NORMS_HEADER, 0, SegmentMerger.NORMS_HEADER.length);

      final int numField = fieldInfos.size();

      int normCount = 0;

      //對每一個域進行處理

      for(int fieldNumber=0;fieldNumber

        final FieldInfo fieldInfo = fieldInfos.fieldInfo(fieldNumber);

        //得到同名域的鏈表

        List toMerge = byField.get(fieldInfo);

        int upto = 0;

        if (toMerge != null) {

          final int numFields = toMerge.size();

          normCount++;

          final NormsWriterPerField[] fields = new NormsWriterPerField[numFields];

          int[] uptos = new int[numFields];

          for(int j=0;j

            fields[j] = toMerge.get(j);

          int numLeft = numFields;

          //處理同名的多個域

          while(numLeft > 0) {

            //得到所有的同名域中最小的文檔號

            int minLoc = 0;

            int minDocID = fields[0].docIDs[uptos[0]];

            for(int j=1;j

              final int docID = fields[j].docIDs[uptos[j]];

              if (docID < minDocID) {

                minDocID = docID;

                minLoc = j;

              }

            }

            // 在nrm文件中,每一個文件都有一個位置,沒有設定的,放入默認值

            for (;upto<minDocID;upto++)

              normsOut.writeByte(defaultNorm);

            //寫入當前的nrm值

            normsOut.writeByte(fields[minLoc].norms[uptos[minLoc]]);

            (uptos[minLoc])++;

            upto++;

            //如果當前域的文檔已經處理完畢,則numLeft減一,歸零時推出循環。

            if (uptos[minLoc] == fields[minLoc].upto) {

              fields[minLoc].reset();

              if (minLoc != numLeft-1) {

                fields[minLoc] = fields[numLeft-1];

                uptos[minLoc] = uptos[numLeft-1];

              }

              numLeft--;

            }

          }

          // 對所有的未設定nrm值的文檔寫入默認值。

          for(;upto

            normsOut.writeByte(defaultNorm);

        } else if (fieldInfo.isIndexed && !fieldInfo.omitNorms) {

          normCount++;

          // Fill entire field with default norm:

          for(;upto

            normsOut.writeByte(defaultNorm);

        }

      }

    } finally {

      normsOut.close();

    }

  }

6.2.2.3、寫入域元數據

代碼爲:

FieldInfos.write(IndexOutput) {

    output.writeVInt(CURRENT_FORMAT);

    output.writeVInt(size());

    for (int i = 0; i < size(); i++) {

      FieldInfo fi = fieldInfo(i);

      byte bits = 0x0;

      if (fi.isIndexed) bits |= IS_INDEXED;

      if (fi.storeTermVector) bits |= STORE_TERMVECTOR;

      if (fi.storePositionWithTermVector) bits |= STORE_POSITIONS_WITH_TERMVECTOR;

      if (fi.storeOffsetWithTermVector) bits |= STORE_OFFSET_WITH_TERMVECTOR;

      if (fi.omitNorms) bits |= OMIT_NORMS;

      if (fi.storePayloads) bits |= STORE_PAYLOADS;

      if (fi.omitTermFreqAndPositions) bits |= OMIT_TERM_FREQ_AND_POSITIONS;

      output.writeString(fi.name);

      output.writeByte(bits);

    }

}

此處基本就是按照fnm文件的格式寫入的。

6.3、生成新的段信息對象

代碼:

newSegment = new SegmentInfo(segment, flushedDocCount, directory, false, true, docStoreOffset, docStoreSegment, docStoreIsCompoundFile, docWriter.hasProx());

segmentInfos.add(newSegment);

6.4、準備刪除文檔

代碼:

docWriter.pushDeletes();

    --> deletesFlushed.update(deletesInRAM);

此處將deletesInRAM全部加到deletesFlushed中,並把deletesInRAM清空。原因上面已經闡明。

6.5、生成cfs段

代碼:

docWriter.createCompoundFile(segment);

newSegment.setUseCompoundFile(true);

代碼爲:

DocumentsWriter.createCompoundFile(String segment) {

    CompoundFileWriter cfsWriter = new CompoundFileWriter(directory, segment + "." + IndexFileNames.COMPOUND_FILE_EXTENSION);

    //將上述中記錄的文檔名全部加入cfs段的寫對象。

    for (final String flushedFile : flushState.flushedFiles)

      cfsWriter.addFile(flushedFile);

    cfsWriter.close();

  }

6.6、刪除文檔

代碼:

applyDeletes();

代碼爲:

boolean applyDeletes(SegmentInfos infos) {

  if (!hasDeletes())

    return false;

  final int infosEnd = infos.size();

  int docStart = 0;

  boolean any = false;

  for (int i = 0; i < infosEnd; i++) {

    assert infos.info(i).dir == directory;

    SegmentReader reader = writer.readerPool.get(infos.info(i), false);

    try {

      any |= applyDeletes(reader, docStart);

      docStart += reader.maxDoc();

    } finally {

      writer.readerPool.release(reader);

    }

  }

  deletesFlushed.clear();

  return any;

}

  • Lucene刪除文檔可以用reader,也可以用writer,但是歸根結底還是用reader來刪除的。
  • reader的刪除有以下三種方式:
    • 按照詞刪除,刪除所有包含此詞的文檔。
    • 按照文檔號刪除。
    • 按照查詢對象刪除,刪除所有滿足此查詢的文檔。
  • 但是這三種方式歸根結底還是按照文檔號刪除,也就是寫.del文件的過程。

private final synchronized boolean applyDeletes(IndexReader reader, int docIDStart)

  throws CorruptIndexException, IOException {

  final int docEnd = docIDStart + reader.maxDoc();

  boolean any = false;

  //按照詞刪除,刪除所有包含此詞的文檔。

  TermDocs docs = reader.termDocs();

  try {

    for (Entry entry: deletesFlushed.terms.entrySet()) {

      Term term = entry.getKey();

      docs.seek(term);

      int limit = entry.getValue().getNum();

      while (docs.next()) {

        int docID = docs.doc();

        if (docIDStart+docID >= limit)

          break;

        reader.deleteDocument(docID);

        any = true;

      }

    }

  } finally {

    docs.close();

  }

  //按照文檔號刪除。

  for (Integer docIdInt : deletesFlushed.docIDs) {

    int docID = docIdInt.intValue();

    if (docID >= docIDStart && docID < docEnd) {

      reader.deleteDocument(docID-docIDStart);

      any = true;

    }

  }

  //按照查詢對象刪除,刪除所有滿足此查詢的文檔。

  IndexSearcher searcher = new IndexSearcher(reader);

  for (Entry entry : deletesFlushed.queries.entrySet()) {

    Query query = entry.getKey();

    int limit = entry.getValue().intValue();

    Weight weight = query.weight(searcher);

    Scorer scorer = weight.scorer(reader, true, false);

    if (scorer != null) {

      while(true)  {

        int doc = scorer.nextDoc();

        if (((long) docIDStart) + doc >= limit)

          break;

        reader.deleteDocument(doc);

        any = true;

      }

    }

  }

  searcher.close();

  return any;

}

 

 

 

發佈了33 篇原創文章 · 獲贊 1 · 訪問量 15萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章