lucene索引源码分析2

上一篇文章大概讲了索引从indexwriter到defaultindexchain的过程，也分析了defaultindexchain的基本流程，

主要就是：

将dwpt接收的每个文档一条条处理---》对每一条文档再按Field依次处理---》对每个Field依据他是否分词，是否存储是否有docvalue再分别处理。

可见每个dwpt之间是并行的做事情，每个dwpt内是串行的做事情。

每个field的具体处理是需要写termsHashPerField的信息，而这个信息是被termsHash统一管理的，可以把termsHash理解为一个dwpt中共享的缓冲区，主要用于在内存中建立索引，并在需要fulsh的时候刷入磁盘，每一个dwpt被刷入磁盘后其实就是一个段，当然如果段的大小被设置的话，可能还需要进行段合并之类的。这是后话，我们将在以后专门分析flush，在本文中我们主要继续上一篇分析pf.invert（），代码如下

 public void invert(IndexableField field, boolean first) throws IOException, AbortingException {
      if (first) {
        // First time we're seeing this field (indexed) in
        // this document:
        invertState.reset();      //第一次的话需要重置信息
      }
      
      IndexableFieldType fieldType = field.fieldType();
      
      IndexOptions indexOptions = fieldType.indexOptions();
      fieldInfo.setIndexOptions(indexOptions);
      
      if (fieldType.omitNorms()) {
        fieldInfo.setOmitsNorms();
      }
      
      final boolean analyzed = fieldType.tokenized() && docState.analyzer != null;
      
      // only bother checking offsets if something will consume them.
      // TODO: after we fix analyzers, also check if termVectorOffsets will be indexed.
      final boolean checkOffsets = indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;
      
      /*
       * To assist people in tracking down problems in analysis components, we wish to write the field name to the
       * infostream when we fail. We expect some caller to eventually deal with the real exception, so we don't want any
       * 'catch' clauses, but rather a finally that takes note of the problem.
       */
      boolean succeededInProcessingField = false;
      try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) {    //获取field内容的token流
        // reset the TokenStream to the first token
        stream.reset();
        invertState.setAttributeSource(stream);
        termsHashPerField.start(field, first);
        
        while (stream.incrementToken()) {   //对field内容中分出的每一个term进行处理
          
          // If we hit an exception in stream.next below
          // (which is fairly common, e.g. if analyzer
          // chokes on a given document), then it's
          // non-aborting and (above) this one document
          // will be marked as deleted, but still
          // consume a docID
          
          int posIncr = invertState.posIncrAttribute.getPositionIncrement();
          invertState.position += posIncr;
          if (invertState.position < invertState.lastPosition) {
            if (posIncr == 0) {
              throw new IllegalArgumentException("first position increment must be > 0 (got 0) for field '"
                  + field.name() + "'");
            } else {
              throw new IllegalArgumentException("position increments (and gaps) must be >= 0 (got " + posIncr
                  + ") for field '" + field.name() + "'");
            }
          } else if (invertState.position > IndexWriter.MAX_POSITION) {
            throw new IllegalArgumentException("position " + invertState.position + " is too large for field '"
                + field.name() + "': max allowed position is " + IndexWriter.MAX_POSITION);
          }
          invertState.lastPosition = invertState.position;
          if (posIncr == 0) {
            invertState.numOverlap++;
          }
          
          if (checkOffsets) {
            int startOffset = invertState.offset + invertState.offsetAttribute.startOffset();
            int endOffset = invertState.offset + invertState.offsetAttribute.endOffset();
            if (startOffset < invertState.lastStartOffset || endOffset < startOffset) {
              throw new IllegalArgumentException(
                  "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards "
                      + "startOffset=" + startOffset + ",endOffset=" + endOffset + ",lastStartOffset="
                      + invertState.lastStartOffset + " for field '" + field.name() + "'");
            }
            invertState.lastStartOffset = startOffset;
          }
          
          invertState.length++;
          if (invertState.length < 0) {
            throw new IllegalArgumentException("too many tokens in field '" + field.name() + "'");
          }
          // System.out.println("  term=" + invertState.termAttribute);
          
          // If we hit an exception in here, we abort
          // all buffered documents since the last
          // flush, on the likelihood that the
          // internal state of the terms hash is now
          // corrupt and should not be flushed to a
          // new segment:
          try {
            termsHashPerField.add();      //将term加入当前内存中的索引结构，主要有两个处理链，一个是FreqProxTermsWriterPerField，另一个是TermVectorsConsumerPerField
          } catch (MaxBytesLengthExceededException e) {
            byte[] prefix = new byte[30];
            BytesRef bigTerm = invertState.termAttribute.getBytesRef();
            System.arraycopy(bigTerm.bytes, bigTerm.offset, prefix, 0, 30);
            String msg = "Document contains at least one immense term in field=\""
                + fieldInfo.name
                + "\" (whose UTF8 encoding is longer than the max length "
                + DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8
                + "), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '"
                + Arrays.toString(prefix) + "...', original message: " + e.getMessage();
            if (docState.infoStream.isEnabled("IW")) {
              docState.infoStream.message("IW", "ERROR: " + msg);
            }
            // Document will be deleted above:
            throw new IllegalArgumentException(msg, e);
          } catch (Throwable th) {
            throw AbortingException.wrap(th);
          }
        }
        
        // trigger streams to perform end-of-stream operations
        stream.end();   //token流结束
        
        // TODO: maybe add some safety? then again, it's already checked
        // when we come back around to the field...
        invertState.position += invertState.posIncrAttribute.getPositionIncrement();
        invertState.offset += invertState.offsetAttribute.endOffset();
        
        /* if there is an exception coming through, we won't set this to true here: */
        succeededInProcessingField = true;
      } finally {
        if (!succeededInProcessingField && docState.infoStream.isEnabled("DW")) {
          docState.infoStream.message("DW", "An exception was thrown while processing field " + fieldInfo.name);
        }
      }
      
      if (analyzed) {
        invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
        invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
      }
      
      invertState.boost *= field.boost();
    }
  }

termsHashPerField.add()方法比较重要，他负责利用token填充内存中的缓冲区，代码如下：

void add() throws IOException {
    // We are first in the chain so we must "intern" the
    // term text into textStart address
    // Get the text & hash of this term.
    int termID = bytesHash.add(termAtt.getBytesRef()); // 判断term是否存在，存在返回大于0，否则小于零
    
    // System.out.println("add term=" + termBytesRef.utf8ToString() + " doc=" + docState.docID + " termID=" + termID);
    
    if (termID >= 0) {// New posting
      bytesHash.byteStart(termID);
      // Init stream slices
      if (numPostingInt + intPool.intUpto > IntBlockPool.INT_BLOCK_SIZE) {
        intPool.nextBuffer(); // intPool分配空间
      }
      
      if (ByteBlockPool.BYTE_BLOCK_SIZE - bytePool.byteUpto < numPostingInt * ByteBlockPool.FIRST_LEVEL_SIZE) {
        bytePool.nextBuffer(); // bytePool分配空间
      }
      
      intUptos = intPool.buffer;
      intUptoStart = intPool.intUpto;
      intPool.intUpto += streamCount;
      
      postingsArray.intStarts[termID] = intUptoStart + intPool.intOffset; // 赋值intstart
      
      for (int i = 0; i < streamCount; i++) {
        final int upto = bytePool.newSlice(ByteBlockPool.FIRST_LEVEL_SIZE);
        intUptos[intUptoStart + i] = upto + bytePool.byteOffset;
      }
      postingsArray.byteStarts[termID] = intUptos[intUptoStart]; // 赋值bytestart
      
      newTerm(termID); // 新加一个term，调用freproxtermwriter
      
    } else {
      termID = (-termID) - 1;
      int intStart = postingsArray.intStarts[termID];
      intUptos = intPool.buffers[intStart >> IntBlockPool.INT_BLOCK_SHIFT];
      intUptoStart = intStart & IntBlockPool.INT_BLOCK_MASK;
      addTerm(termID);
    }
    
    if (doNextCall) {
      nextPerField.add(postingsArray.textStarts[termID]); // 继续一次，在term处理时会调用termvectorconsumer，调用另外一个add函数
    }
  }

在这里涉及到了postingsArray，intPool，bytePool，这三者的关系可以用一张图片表示：

我尽量用一句话来解释，postingsArray是一个二维数组，每一行是一个term，textstart记录了这个term的值在bytepool中的偏移量，bytestart是结束地址，intstart是在intpool中的偏移量。intpool和byte是所有field共享的，postingsArray是每个field独有的，所以真正写入段的信息是通过bytepool和intpool，这一点很重要，所以操作这两个空间一定是串行的。而lucene的执行流程也保证了这一点。

具体解释可以查看这一篇文章http://blog.csdn.net/liweisnake/article/details/11364597

这里的关系比较难以搞懂，盗图一张：

接着上面的代码，addTerm()和newTerm()是具体写入bytepool和intpool的，而newTerm分别又是调用了freproxtermwriter以及termvectorconsumer中的newTerm（）方法。下面先来看看freproxtermwriter.newTerm（）

  void newTerm(final int termID) {
    // First time we're seeing this term since the last
    // flush
    final FreqProxPostingsArray postings = freqProxPostingsArray;

    postings.lastDocIDs[termID] = docState.docID;
    if (!hasFreq) {
      assert postings.termFreqs == null;
      postings.lastDocCodes[termID] = docState.docID;
    } else {
      postings.lastDocCodes[termID] = docState.docID << 1;
      postings.termFreqs[termID] = 1;
      if (hasProx) {
        writeProx(termID, fieldState.position);   //将位置信息写入bytepool
        if (hasOffsets) {
          writeOffsets(termID, fieldState.offset);  //将位置偏移量写入bytepool
        }
      } else {
        assert !hasOffsets;
      }
    }
    fieldState.maxTermFrequency = Math.max(1, fieldState.maxTermFrequency);
    fieldState.uniqueTermCount++;
  }

termvectorconsumer.newTerm()的过程类似，也是写入intpool和bytepool，但是这两个newTerm写入的内容不一样，前者写入词频，位置信息，后者写入Term长度和term的内容。回到刚才的add函数，里面调用了 int termID = bytesHash.add(termAtt.getBytesRef()); 如果term存在则会返回一个小于0的数也就是-(termID+1)，如果不存在会在bytepool中写入term的长度和值代码如下：

public int add(BytesRef bytes) {
    assert bytesStart != null : "Bytesstart is null - not initialized";
    final int length = bytes.length;
    // final position
    final int hashPos = findHash(bytes);
    int e = ids[hashPos];
    
    if (e == -1) {
      // new entry
      final int len2 = 2 + bytes.length;
      if (len2 + pool.byteUpto > BYTE_BLOCK_SIZE) {
        if (len2 > BYTE_BLOCK_SIZE) {
          throw new MaxBytesLengthExceededException("bytes can be at most "
              + (BYTE_BLOCK_SIZE - 2) + " in length; got " + bytes.length);
        }
        pool.nextBuffer();
      }
      final byte[] buffer = pool.buffer;
      final int bufferUpto = pool.byteUpto;
      if (count >= bytesStart.length) {
        bytesStart = bytesStartArray.grow();
        assert count < bytesStart.length + 1 : "count: " + count + " len: "
            + bytesStart.length;
      }
      e = count++;

      bytesStart[e] = bufferUpto + pool.byteOffset;     //在二维数组中写入bytestart

      // We first encode the length, followed by the
      // bytes. Length is encoded as vInt, but will consume
      // 1 or 2 bytes at most (we reject too-long terms,
      // above).
      if (length < 128) {
        // 1 byte to store length
        buffer[bufferUpto] = (byte) length;     //term的长度
        pool.byteUpto += length + 1;
        assert length >= 0: "Length must be positive: " + length;
        System.arraycopy(bytes.bytes, bytes.offset, buffer, bufferUpto + 1, 
            length);                            //term的值
      } else {
        // 2 byte to store length
        buffer[bufferUpto] = (byte) (0x80 | (length & 0x7f));
        buffer[bufferUpto + 1] = (byte) ((length >> 7) & 0xff);
        pool.byteUpto += length + 2;
        System.arraycopy(bytes.bytes, bytes.offset, buffer, bufferUpto + 2,
            length);
      }
      assert ids[hashPos] == -1;
      ids[hashPos] = e;       //写入termID
      
      if (count == hashHalfSize) {
        rehash(2 * hashSize, true);
      }
      return e;   //返回termID
    }
    return -(e + 1);  //存在返回-(termID+1)的
  }

至此可以看出这里面写入了倒排信息，具体怎么和索引文件对应，我将在下一篇文章中继续分析。

如有不正确，请指正。

lucene索引源码分析2

如何使用 JS 判断用户是否处于活跃状态

通过HPA+CronHPA组合应对业务复杂弹性伸缩场景

lucene索引源碼分析2

基於lucene的mr索引程序的實現

lucene索引源碼分析1

ansj源碼淺析2

ansj源碼淺析3

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結