6、關閉IndexWriter對象
代碼:
writer.close(); --> IndexWriter.closeInternal(boolean) --> (1) 將索引信息由內存寫入磁盤: flush(waitForMerges, true, true); |
對段的合併將在後面的章節進行討論,此處僅僅討論將索引信息由寫入磁盤的過程。
代碼:
IndexWriter.flush(boolean triggerMerge, boolean flushDocStores, boolean flushDeletes) --> IndexWriter.doFlush(boolean flushDocStores, boolean flushDeletes) --> IndexWriter.doFlushInternal(boolean flushDocStores, boolean flushDeletes) |
將索引寫入磁盤包括以下幾個過程:
- 得到要寫入的段名:String segment = docWriter.getSegment();
- DocumentsWriter將緩存的信息寫入段:docWriter.flush(flushDocStores);
- 生成新的段信息對象:newSegment = new SegmentInfo(segment, flushedDocCount, directory, false, true, docStoreOffset, docStoreSegment, docStoreIsCompoundFile, docWriter.hasProx());
- 準備刪除文檔:docWriter.pushDeletes();
- 生成cfs段:docWriter.createCompoundFile(segment);
- 刪除文檔:applyDeletes();
6.1、得到要寫入的段名
代碼:
SegmentInfo newSegment = null; final int numDocs = docWriter.getNumDocsInRAM();//文檔總數 String docStoreSegment = docWriter.getDocStoreSegment();//存儲域和詞向量所要要寫入的段名,"_0" int docStoreOffset = docWriter.getDocStoreOffset();//存儲域和詞向量要寫入的段中的偏移量 String segment = docWriter.getSegment();//段名,"_0" |
在Lucene的索引文件結構一章做過詳細介紹,存儲域和詞向量可以和索引域存儲在不同的段中。
6.2、將緩存的內容寫入段
代碼:
flushedDocCount = docWriter.flush(flushDocStores); |
此過程又包含以下兩個階段;
- 按照基本索引鏈關閉存儲域和詞向量信息
- 按照基本索引鏈的結構將索引結果寫入段
6.2.1、按照基本索引鏈關閉存儲域和詞向量信息
代碼爲:
closeDocStore(); flushState.numDocsInStore = 0; |
其主要是根據基本索引鏈結構,關閉存儲域和詞向量信息:
- consumer(DocFieldProcessor).closeDocStore(flushState);
- consumer(DocInverter).closeDocStore(state);
- consumer(TermsHash).closeDocStore(state);
- consumer(FreqProxTermsWriter).closeDocStore(state);
- if (nextTermsHash != null) nextTermsHash.closeDocStore(state);
- consumer(TermVectorsTermsWriter).closeDocStore(state);
- endConsumer(NormsWriter).closeDocStore(state);
- consumer(TermsHash).closeDocStore(state);
- fieldsWriter(StoredFieldsWriter).closeDocStore(state);
- consumer(DocInverter).closeDocStore(state);
其中有實質意義的是以下兩個closeDocStore:
- 詞向量的關閉:TermVectorsTermsWriter.closeDocStore(SegmentWriteState)
void closeDocStore(final SegmentWriteState state) throws IOException { if (tvx != null) {//爲不保存詞向量的文檔在tvd文件中寫入零。即便不保存詞向量,在tvx, tvd中也保留一個位置 fill(state.numDocsInStore - docWriter.getDocStoreOffset()); //關閉tvx, tvf, tvd文件的寫入流 tvx.close(); tvf.close(); tvd.close(); tvx = null; //記錄寫入的文件名,爲以後生成cfs文件的時候,將這些寫入的文件生成一個統一的cfs文件。 state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION); state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION); state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION); //從DocumentsWriter的成員變量openFiles中刪除,未來可能被IndexFileDeleter刪除 docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION); docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION); docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION); lastDocID = 0; } } |
- 存儲域的關閉:StoredFieldsWriter.closeDocStore(SegmentWriteState)
public void closeDocStore(SegmentWriteState state) throws IOException { //關閉fdx, fdt寫入流 fieldsWriter.close(); //記錄寫入的文件名 |
6.2.2、按照基本索引鏈的結構將索引結果寫入段
代碼爲:
consumer(DocFieldProcessor).flush(threads, flushState); //回收fieldHash,以便用於下一輪的索引,爲提高效率,索引鏈中的對象是被複用的。 Map> childThreadsAndFields = new HashMap>(); //寫入存儲域 --> fieldsWriter(StoredFieldsWriter).flush(state); //寫入索引域 --> consumer(DocInverter).flush(childThreadsAndFields, state); //寫入域元數據信息,並記錄寫入的文件名,以便以後生成cfs文件 --> final String fileName = state.segmentFileName(IndexFileNames.FIELD_INFOS_EXTENSION); --> fieldInfos.write(state.directory, fileName); --> state.flushedFiles.add(fileName); |
此過程也是按照基本索引鏈來的:
- consumer(DocFieldProcessor).flush(…);
- consumer(DocInverter).flush(…);
- consumer(TermsHash).flush(…);
- consumer(FreqProxTermsWriter).flush(…);
- if (nextTermsHash != null) nextTermsHash.flush(…);
- consumer(TermVectorsTermsWriter).flush(…);
- endConsumer(NormsWriter).flush(…);
- consumer(TermsHash).flush(…);
- fieldsWriter(StoredFieldsWriter).flush(…);
- consumer(DocInverter).flush(…);
6.2.2.1、寫入存儲域
代碼爲:
StoredFieldsWriter.flush(SegmentWriteState state) { |
從代碼中可以看出,是寫入fdx, fdt兩個文件,但是在上述的closeDocStore已經寫入了,並且把state.numDocsInStore置零,fieldsWriter設爲null,在這裏其實什麼也不做。
6.2.2.2、寫入索引域
代碼爲:
DocInverter.flush(Map>, SegmentWriteState) //寫入倒排表及詞向量信息 --> consumer(TermsHash).flush(childThreadsAndFields, state); //寫入標準化因子 --> endConsumer(NormsWriter).flush(endChildThreadsAndFields, state); |
6.2.2.2.1、寫入倒排表及詞向量信息
代碼爲:
TermsHash.flush(Map>, SegmentWriteState) //寫入倒排表信息 --> consumer(FreqProxTermsWriter).flush(childThreadsAndFields, state); //回收RawPostingList --> shrinkFreePostings(threadsAndFields, state); //寫入詞向量信息 --> if (nextTermsHash != null) nextTermsHash.flush(nextThreadsAndFields, state); --> consumer(TermVectorsTermsWriter).flush(childThreadsAndFields, state); |
6.2.2.2.1.1、寫入倒排表信息
代碼爲:
FreqProxTermsWriter.flush(Map Collection>, SegmentWriteState) (a) 所有域按名稱排序,使得同名域能夠一起處理 Collections.sort(allFields); final int numAllFields = allFields.size(); (b) 生成倒排表的寫對象 final FormatPostingsFieldsConsumer consumer = new FormatPostingsFieldsWriter(state, fieldInfos); int start = 0; (c) 對於每一個域 while(start < numAllFields) { (c-1) 找出所有的同名域 final FieldInfo fieldInfo = allFields.get(start).fieldInfo; final String fieldName = fieldInfo.name; int end = start+1; while(end < numAllFields && allFields.get(end).fieldInfo.name.equals(fieldName)) end++; FreqProxTermsWriterPerField[] fields = new FreqProxTermsWriterPerField[end-start]; for(int i=start;i fields[i-start] = allFields.get(i); fieldInfo.storePayloads |= fields[i-start].hasPayloads; } (c-2) 將同名域的倒排表添加到文件 appendPostings(fields, consumer); (c-3) 釋放空間 for(int i=0;i TermsHashPerField perField = fields[i].termsHashPerField; int numPostings = perField.numPostings; perField.reset(); perField.shrinkHash(numPostings); fields[i].reset(); } start = end; } (d) 關閉倒排表的寫對象 consumer.finish(); |
(b) 生成倒排表的寫對象
代碼爲:
public FormatPostingsFieldsWriter(SegmentWriteState state, FieldInfos fieldInfos) throws IOException { dir = state.directory; segment = state.segmentName; totalNumDocs = state.numDocs; this.fieldInfos = fieldInfos; //用於寫tii,tis termsOut = new TermInfosWriter(dir, segment, fieldInfos, state.termIndexInterval); //用於寫freq, prox的跳錶 skipListWriter = new DefaultSkipListWriter(termsOut.skipInterval, termsOut.maxSkipLevels, totalNumDocs, null, null); //記錄寫入的文件名, state.flushedFiles.add(state.segmentFileName(IndexFileNames.TERMS_EXTENSION)); state.flushedFiles.add(state.segmentFileName(IndexFileNames.TERMS_INDEX_EXTENSION)); //用以上兩個寫對象,按照一定的格式寫入段 termsWriter = new FormatPostingsTermsWriter(state, this); } |
對象結構如下:
consumer FormatPostingsFieldsWriter (id=119) //用於處理一個域 dir SimpleFSDirectory (id=126) //目標索引文件夾 totalNumDocs 8 //文檔總數 fieldInfos FieldInfos (id=70) //域元數據信息 segment "_0" //段名 skipListWriter DefaultSkipListWriter (id=133) //freq, prox中跳錶的寫對象 termsOut TermInfosWriter (id=125) //tii, tis文件的寫對象 termsWriter FormatPostingsTermsWriter (id=135) //用於添加詞(Term) currentTerm null currentTermStart 0 fieldInfo null freqStart 0 proxStart 0 termBuffer null termsOut TermInfosWriter (id=125) docsWriter FormatPostingsDocsWriter (id=139) //用於寫入此詞的docid, freq信息 df 0 fieldInfo null freqStart 0 lastDocID 0 omitTermFreqAndPositions false out SimpleFSDirectory$SimpleFSIndexOutput (id=144) skipInterval 16 skipListWriter DefaultSkipListWriter (id=133) storePayloads false termInfo TermInfo (id=151) totalNumDocs 8 posWriter FormatPostingsPositionsWriter (id=146) //用於寫入此詞在此文檔中的位置信息 lastPayloadLength -1 lastPosition 0 omitTermFreqAndPositions false out SimpleFSDirectory$SimpleFSIndexOutput (id=157) parent FormatPostingsDocsWriter (id=139) storePayloads false |
- FormatPostingsFieldsWriter.addField(FieldInfo field)用於添加索引域信息,其返回FormatPostingsTermsConsumer用於添加詞信息
- FormatPostingsTermsConsumer.addTerm(char[] text, int start)用於添加詞信息,其返回FormatPostingsDocsConsumer用於添加freq信息
- FormatPostingsDocsConsumer.addDoc(int docID, int termDocFreq)用於添加freq信息,其返回FormatPostingsPositionsConsumer用於添加prox信息
- FormatPostingsPositionsConsumer.addPosition(int position, byte[] payload, int payloadOffset, int payloadLength)用於添加prox信息
(c-2) 將同名域的倒排表添加到文件
代碼爲:
FreqProxTermsWriter.appendPostings(FreqProxTermsWriterPerField[], FormatPostingsFieldsConsumer) { int numFields = fields.length; final FreqProxFieldMergeState[] mergeStates = new FreqProxFieldMergeState[numFields]; for(int i=0;i FreqProxFieldMergeState fms = mergeStates[i] = new FreqProxFieldMergeState(fields[i]); boolean result = fms.nextTerm(); //對所有的域,取第一個詞(Term) } (1) 添加此域,雖然有多個域,但是由於是同名域,只取第一個域的信息即可。返回的是用於添加此域中的詞的對象。 final FormatPostingsTermsConsumer termsConsumer = consumer.addField(fields[0].fieldInfo); FreqProxFieldMergeState[] termStates = new FreqProxFieldMergeState[numFields]; final boolean currentFieldOmitTermFreqAndPositions = fields[0].fieldInfo.omitTermFreqAndPositions; (2) 此while循環是遍歷每一個尚有未處理的詞的域,依次按照詞典順序處理這些域所包含的詞。當一個域中的所有的詞都被處理過後,則numFields減一,並從mergeStates數組中移除此域。直到所有的域的所有的詞都處理完畢,方纔退出此循環。 while(numFields > 0) { (2-1) 找出所有域中按字典順序的下一個詞。可能多個同名域中,都包含同一個term,因而要遍歷所有的numFields,得到所有的域裏的下一個詞,numToMerge即爲有多少個域包含此詞。 termStates[0] = mergeStates[0]; int numToMerge = 1; for(int i=1;i final char[] text = mergeStates[i].text; final int textOffset = mergeStates[i].textOffset; final int cmp = compareText(text, textOffset, termStates[0].text, termStates[0].textOffset); if (cmp < 0) { termStates[0] = mergeStates[i]; numToMerge = 1; } else if (cmp == 0) termStates[numToMerge++] = mergeStates[i]; } (2-2) 添加此詞,返回FormatPostingsDocsConsumer用於添加文檔號(doc ID)及詞頻信息(freq) final FormatPostingsDocsConsumer docConsumer = termsConsumer.addTerm(termStates[0].text, termStates[0].textOffset); (2-3) 由於共numToMerge個域都包含此詞,每個詞都有一個鏈表的文檔號表示包含這些詞的文檔。此循環遍歷所有的包含此詞的域,依次按照從小到大的循序添加包含此詞的文檔號及詞頻信息。當一個域中對此詞的所有文檔號都處理過了,則numToMerge減一,並從termStates數組中移除此域。當所有包含此詞的域的所有文檔號都處理過了,則結束此循環。 while(numToMerge > 0) { (2-3-1) 找出最小的文檔號 FreqProxFieldMergeState minState = termStates[0]; for(int i=1;i if (termStates[i].docID < minState.docID) minState = termStates[i]; final int termDocFreq = minState.termFreq; (2-3-2) 添加文檔號及詞頻信息,並形成跳錶,返回FormatPostingsPositionsConsumer用於添加位置(prox)信息 final FormatPostingsPositionsConsumer posConsumer = docConsumer.addDoc(minState.docID, termDocFreq); //ByteSliceReader是用於讀取bytepool中的prox信息的。 final ByteSliceReader prox = minState.prox; if (!currentFieldOmitTermFreqAndPositions) { int position = 0; (2-3-3) 此循環對包含此詞的文檔,添加位置信息 for(int j=0;j final int code = prox.readVInt(); position += code >> 1; final int payloadLength; // 如果此位置有payload信息,則從bytepool中讀出,否則設爲零。 if ((code & 1) != 0) { payloadLength = prox.readVInt(); if (payloadBuffer == null || payloadBuffer.length < payloadLength) payloadBuffer = new byte[payloadLength]; prox.readBytes(payloadBuffer, 0, payloadLength); } else payloadLength = 0; //添加位置(prox)信息 posConsumer.addPosition(position, payloadBuffer, 0, payloadLength); } posConsumer.finish(); } (2-3-4) 判斷退出條件,上次選中的域取得下一個文檔號,如果沒有,則說明此域包含此詞的文檔已經處理完畢,則從termStates中刪除此域,並將numToMerge減一。然後此域取得下一個詞,當循環到(2)的時候,表明此域已經開始處理下一個詞。如果沒有下一個詞,說明此域中的所有的詞都處理完畢,則從mergeStates中刪除此域,並將numFields減一,當numFields爲0的時候,循環(2)也就結束了。 if (!minState.nextDoc()) {//獲得下一個docid //如果此域包含此詞的文檔已經沒有下一篇docid,則從數組termStates中移除,numToMerge減一。 int upto = 0; for(int i=0;i if (termStates[i] != minState) termStates[upto++] = termStates[i]; numToMerge--; //此域則取下一個詞(term),在循環(2)處來參與下一個詞的合併 if (!minState.nextTerm()) { //如果此域沒有下一個詞了,則此域從數組mergeStates中移除,numFields減一。 upto = 0; for(int i=0;i if (mergeStates[i] != minState) mergeStates[upto++] = mergeStates[i]; numFields--; } } } (2-4) 經過上面的過程,docid和freq信息雖已經寫入段文件,而跳錶信息並沒有寫到文件中,而是寫入skip buffer裏面了,此處真正寫入文件。並且詞典(tii, tis)也應該寫入文件。 docConsumer(FormatPostingsDocsWriter).finish(); } termsConsumer.finish(); } |
(2-3-4) 獲得下一篇文檔號代碼如下:
public boolean nextDoc() {//如何獲取下一個docid if (freq.eof()) {//如果bytepool中的freq信息已經讀完 if (p.lastDocCode != -1) {//由上述緩存管理,PostingList裏面還存着最後一篇文檔的文檔號及詞頻信息,則將最後一篇文檔返回 docID = p.lastDocID; if (!field.omitTermFreqAndPositions) termFreq = p.docFreq; p.lastDocCode = -1; return true; } else return false;//沒有下一篇文檔 } final int code = freq.readVInt();//如果bytepool中的freq信息尚未讀完 if (field.omitTermFreqAndPositions) docID += code; else { //讀出文檔號及詞頻信息。 docID += code >>> 1; if ((code & 1) != 0) termFreq = 1; else termFreq = freq.readVInt(); } return true; } |
(2-3-2) 添加文檔號及詞頻信息代碼如下:
FormatPostingsPositionsConsumer FormatPostingsDocsWriter.addDoc(int docID, int termDocFreq) { final int delta = docID - lastDocID; //當文檔數量達到skipInterval倍數的時候,添加跳錶項。 if ((++df % skipInterval) == 0) { skipListWriter.setSkipData(lastDocID, storePayloads, posWriter.lastPayloadLength); skipListWriter.bufferSkip(df); } lastDocID = docID; if (omitTermFreqAndPositions) out.writeVInt(delta); else if (1 == termDocFreq) out.writeVInt((delta<<1) | 1); else { //寫入文檔號及詞頻信息。 out.writeVInt(delta<<1); out.writeVInt(termDocFreq); } return posWriter; } |
(2-3-3) 添加位置信息:
FormatPostingsPositionsWriter.addPosition(int position, byte[] payload, int payloadOffset, int payloadLength) { final int delta = position - lastPosition; lastPosition = position; if (storePayloads) { //保存位置及payload信息 if (payloadLength != lastPayloadLength) { lastPayloadLength = payloadLength; out.writeVInt((delta<<1)|1); out.writeVInt(payloadLength); } else out.writeVInt(delta << 1); if (payloadLength > 0) out.writeBytes(payload, payloadLength); } else out.writeVInt(delta); } |
(2-4) 將跳錶和詞典(tii, tis)寫入文件
FormatPostingsDocsWriter.finish() { //將跳錶緩存寫入文件 long skipPointer = skipListWriter.writeSkip(out); if (df > 0) { //將詞典(terminfo)寫入tii,tis文件 parent.termsOut(TermInfosWriter).add(fieldInfo.number, utf8.result, utf8.length, termInfo); } } |
將跳錶緩存寫入文件:
DefaultSkipListWriter(MultiLevelSkipListWriter).writeSkip(IndexOutput) { long skipPointer = output.getFilePointer(); if (skipBuffer == null || skipBuffer.length == 0) return skipPointer; //正如我們在索引文件格式中分析的那樣, 高層在前,低層在後,除最低層外,其他的層都有長度保存。 for (int level = numberOfSkipLevels - 1; level > 0; level--) { long length = skipBuffer[level].getFilePointer(); if (length > 0) { output.writeVLong(length); skipBuffer[level].writeTo(output); } } //寫入最低層 skipBuffer[0].writeTo(output); return skipPointer; } |
將詞典(terminfo)寫入tii,tis文件:
- tii文件是tis文件的類似跳錶的東西,是在tis文件中每隔indexInterval個詞提取出一個詞放在tii文件中,以便很快的查找到詞。
- 因而TermInfosWriter類型中有一個成員變量other也是TermInfosWriter類型的,還有一個成員變量isIndex來表示此對象是用來寫tii文件的還是用來寫tis文件的。
- 如果一個TermInfosWriter對象的isIndex=false則,它是用來寫tis文件的,它的other指向的是用來寫tii文件的TermInfosWriter對象
- 如果一個TermInfosWriter對象的isIndex=true則,它是用來寫tii文件的,它的other指向的是用來寫tis文件的TermInfosWriter對象
TermInfosWriter.add (int fieldNumber, byte[] termBytes, int termBytesLength, TermInfo ti) { //如果詞的總數是indexInterval的倍數,則應該寫入tii文件 if (!isIndex && size % indexInterval == 0) other.add(lastFieldNumber, lastTermBytes, lastTermBytesLength, lastTi); //將詞寫入tis文件 writeTerm(fieldNumber, termBytes, termBytesLength); output.writeVInt(ti.docFreq); // write doc freq output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers output.writeVLong(ti.proxPointer - lastTi.proxPointer); if (ti.docFreq >= skipInterval) { output.writeVInt(ti.skipOffset); } if (isIndex) { output.writeVLong(other.output.getFilePointer() - lastIndexPointer); lastIndexPointer = other.output.getFilePointer(); // write pointer } lastFieldNumber = fieldNumber; lastTi.set(ti); size++; } |
6.2.2.2.1.2、寫入詞向量信息
代碼爲:
TermVectorsTermsWriter.flush (Map> if (tvx != null) { if (state.numDocsInStore > 0) fill(state.numDocsInStore - docWriter.getDocStoreOffset()); tvx.flush(); tvd.flush(); tvf.flush(); } for (Map.Entry> entry : for (final TermsHashConsumerPerField field : entry.getValue() ) { TermVectorsTermsWriterPerField perField = (TermVectorsTermsWriterPerField) field; perField.termsHashPerField.reset(); perField.shrinkHash(); } TermVectorsTermsWriterPerThread perThread = (TermVectorsTermsWriterPerThread) entry.getKey(); perThread.termsHashPerThread.reset(true); } } |
從代碼中可以看出,是寫入tvx, tvd, tvf三個文件,但是在上述的closeDocStore已經寫入了,並且把tvx設爲null,在這裏其實什麼也不做,僅僅是清空postingsHash,以便進行下一輪索引時重用此對象。
6.2.2.2.2、寫入標準化因子
代碼爲:
NormsWriter.flush(Map> threadsAndFields, SegmentWriteState state) { final Map> byField = new HashMap>(); for (final Map.Entry> entry : //遍歷所有的域,將同名域對應的NormsWriterPerField放到同一個鏈表中。 final Collection fields = entry.getValue(); final Iterator fieldsIt = fields.iterator(); while (fieldsIt.hasNext()) { final NormsWriterPerField perField = (NormsWriterPerField) fieldsIt.next(); List l = byField.get(perField.fieldInfo); if (l == null) { l = new ArrayList(); byField.put(perField.fieldInfo, l); } l.add(perField); } //記錄寫入的文件名,方便以後生成cfs文件。 final String normsFileName = state.segmentName + "." + IndexFileNames.NORMS_EXTENSION; state.flushedFiles.add(normsFileName); IndexOutput normsOut = state.directory.createOutput(normsFileName); try { //寫入nrm文件頭 normsOut.writeBytes(SegmentMerger.NORMS_HEADER, 0, SegmentMerger.NORMS_HEADER.length); final int numField = fieldInfos.size(); int normCount = 0; //對每一個域進行處理 for(int fieldNumber=0;fieldNumber final FieldInfo fieldInfo = fieldInfos.fieldInfo(fieldNumber); //得到同名域的鏈表 List toMerge = byField.get(fieldInfo); int upto = 0; if (toMerge != null) { final int numFields = toMerge.size(); normCount++; final NormsWriterPerField[] fields = new NormsWriterPerField[numFields]; int[] uptos = new int[numFields]; for(int j=0;j fields[j] = toMerge.get(j); int numLeft = numFields; //處理同名的多個域 while(numLeft > 0) { //得到所有的同名域中最小的文檔號 int minLoc = 0; int minDocID = fields[0].docIDs[uptos[0]]; for(int j=1;j final int docID = fields[j].docIDs[uptos[j]]; if (docID < minDocID) { minDocID = docID; minLoc = j; } } // 在nrm文件中,每一個文件都有一個位置,沒有設定的,放入默認值 for (;upto<minDocID;upto++) normsOut.writeByte(defaultNorm); //寫入當前的nrm值 normsOut.writeByte(fields[minLoc].norms[uptos[minLoc]]); (uptos[minLoc])++; upto++; //如果當前域的文檔已經處理完畢,則numLeft減一,歸零時推出循環。 if (uptos[minLoc] == fields[minLoc].upto) { fields[minLoc].reset(); if (minLoc != numLeft-1) { fields[minLoc] = fields[numLeft-1]; uptos[minLoc] = uptos[numLeft-1]; } numLeft--; } } // 對所有的未設定nrm值的文檔寫入默認值。 for(;upto normsOut.writeByte(defaultNorm); } else if (fieldInfo.isIndexed && !fieldInfo.omitNorms) { normCount++; // Fill entire field with default norm: for(;upto normsOut.writeByte(defaultNorm); } } } finally { normsOut.close(); } } |
6.2.2.3、寫入域元數據
代碼爲:
FieldInfos.write(IndexOutput) { output.writeVInt(CURRENT_FORMAT); output.writeVInt(size()); for (int i = 0; i < size(); i++) { FieldInfo fi = fieldInfo(i); byte bits = 0x0; if (fi.isIndexed) bits |= IS_INDEXED; if (fi.storeTermVector) bits |= STORE_TERMVECTOR; if (fi.storePositionWithTermVector) bits |= STORE_POSITIONS_WITH_TERMVECTOR; if (fi.storeOffsetWithTermVector) bits |= STORE_OFFSET_WITH_TERMVECTOR; if (fi.omitNorms) bits |= OMIT_NORMS; if (fi.storePayloads) bits |= STORE_PAYLOADS; if (fi.omitTermFreqAndPositions) bits |= OMIT_TERM_FREQ_AND_POSITIONS; output.writeString(fi.name); output.writeByte(bits); } } |
此處基本就是按照fnm文件的格式寫入的。
6.3、生成新的段信息對象
代碼:
newSegment = new SegmentInfo(segment, flushedDocCount, directory, false, true, docStoreOffset, docStoreSegment, docStoreIsCompoundFile, docWriter.hasProx()); segmentInfos.add(newSegment); |
6.4、準備刪除文檔
代碼:
docWriter.pushDeletes(); --> deletesFlushed.update(deletesInRAM); |
此處將deletesInRAM全部加到deletesFlushed中,並把deletesInRAM清空。原因上面已經闡明。
6.5、生成cfs段
代碼:
docWriter.createCompoundFile(segment); newSegment.setUseCompoundFile(true); |
代碼爲:
DocumentsWriter.createCompoundFile(String segment) { CompoundFileWriter cfsWriter = new CompoundFileWriter(directory, segment + "." + IndexFileNames.COMPOUND_FILE_EXTENSION); //將上述中記錄的文檔名全部加入cfs段的寫對象。 for (final String flushedFile : flushState.flushedFiles) cfsWriter.addFile(flushedFile); cfsWriter.close(); } |
6.6、刪除文檔
代碼:
applyDeletes(); |
代碼爲:
boolean applyDeletes(SegmentInfos infos) { if (!hasDeletes()) return false; final int infosEnd = infos.size(); int docStart = 0; boolean any = false; for (int i = 0; i < infosEnd; i++) { assert infos.info(i).dir == directory; SegmentReader reader = writer.readerPool.get(infos.info(i), false); try { any |= applyDeletes(reader, docStart); docStart += reader.maxDoc(); } finally { writer.readerPool.release(reader); } } deletesFlushed.clear(); return any; } |
- Lucene刪除文檔可以用reader,也可以用writer,但是歸根結底還是用reader來刪除的。
- reader的刪除有以下三種方式:
- 按照詞刪除,刪除所有包含此詞的文檔。
- 按照文檔號刪除。
- 按照查詢對象刪除,刪除所有滿足此查詢的文檔。
- 但是這三種方式歸根結底還是按照文檔號刪除,也就是寫.del文件的過程。
private final synchronized boolean applyDeletes(IndexReader reader, int docIDStart) throws CorruptIndexException, IOException { final int docEnd = docIDStart + reader.maxDoc(); boolean any = false; //按照詞刪除,刪除所有包含此詞的文檔。 TermDocs docs = reader.termDocs(); try { for (Entry entry: deletesFlushed.terms.entrySet()) { Term term = entry.getKey(); docs.seek(term); int limit = entry.getValue().getNum(); while (docs.next()) { int docID = docs.doc(); if (docIDStart+docID >= limit) break; reader.deleteDocument(docID); any = true; } } } finally { docs.close(); } //按照文檔號刪除。 for (Integer docIdInt : deletesFlushed.docIDs) { int docID = docIdInt.intValue(); if (docID >= docIDStart && docID < docEnd) { reader.deleteDocument(docID-docIDStart); any = true; } } //按照查詢對象刪除,刪除所有滿足此查詢的文檔。 IndexSearcher searcher = new IndexSearcher(reader); for (Entry entry : deletesFlushed.queries.entrySet()) { Query query = entry.getKey(); int limit = entry.getValue().intValue(); Weight weight = query.weight(searcher); Scorer scorer = weight.scorer(reader, true, false); if (scorer != null) { while(true) { int doc = scorer.nextDoc(); if (((long) docIDStart) + doc >= limit) break; reader.deleteDocument(doc); any = true; } } } searcher.close(); return any; } |