上一篇文章大概讲了索引从indexwriter到defaultindexchain的过程,也分析了defaultindexchain的基本流程,
主要就是:
将dwpt接收的每个文档一条条处理---》对每一条文档再按Field依次处理---》对每个Field依据他是否分词,是否存储是否有docvalue再分别处理。
可见每个dwpt之间是并行的做事情,每个dwpt内是串行的做事情。
每个field的具体处理是需要写termsHashPerField的信息,而这个信息是被termsHash统一管理的,可以把termsHash理解为一个dwpt中共享的缓冲区,主要用于在内存中建立索引,并在需要fulsh的时候刷入磁盘,每一个dwpt被刷入磁盘后其实就是一个段,当然如果段的大小被设置的话,可能还需要进行段合并之类的。这是后话,我们将在以后专门分析flush,在本文中我们主要继续上一篇分析pf.invert(),代码如下
public void invert(IndexableField field, boolean first) throws IOException, AbortingException {
if (first) {
// First time we're seeing this field (indexed) in
// this document:
invertState.reset(); //第一次的话需要重置信息
}
IndexableFieldType fieldType = field.fieldType();
IndexOptions indexOptions = fieldType.indexOptions();
fieldInfo.setIndexOptions(indexOptions);
if (fieldType.omitNorms()) {
fieldInfo.setOmitsNorms();
}
final boolean analyzed = fieldType.tokenized() && docState.analyzer != null;
// only bother checking offsets if something will consume them.
// TODO: after we fix analyzers, also check if termVectorOffsets will be indexed.
final boolean checkOffsets = indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;
/*
* To assist people in tracking down problems in analysis components, we wish to write the field name to the
* infostream when we fail. We expect some caller to eventually deal with the real exception, so we don't want any
* 'catch' clauses, but rather a finally that takes note of the problem.
*/
boolean succeededInProcessingField = false;
try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) { //获取field内容的token流
// reset the TokenStream to the first token
stream.reset();
invertState.setAttributeSource(stream);
termsHashPerField.start(field, first);
while (stream.incrementToken()) { //对field内容中分出的每一个term进行处理
// If we hit an exception in stream.next below
// (which is fairly common, e.g. if analyzer
// chokes on a given document), then it's
// non-aborting and (above) this one document
// will be marked as deleted, but still
// consume a docID
int posIncr = invertState.posIncrAttribute.getPositionIncrement();
invertState.position += posIncr;
if (invertState.position < invertState.lastPosition) {
if (posIncr == 0) {
throw new IllegalArgumentException("first position increment must be > 0 (got 0) for field '"
+ field.name() + "'");
} else {
throw new IllegalArgumentException("position increments (and gaps) must be >= 0 (got " + posIncr
+ ") for field '" + field.name() + "'");
}
} else if (invertState.position > IndexWriter.MAX_POSITION) {
throw new IllegalArgumentException("position " + invertState.position + " is too large for field '"
+ field.name() + "': max allowed position is " + IndexWriter.MAX_POSITION);
}
invertState.lastPosition = invertState.position;
if (posIncr == 0) {
invertState.numOverlap++;
}
if (checkOffsets) {
int startOffset = invertState.offset + invertState.offsetAttribute.startOffset();
int endOffset = invertState.offset + invertState.offsetAttribute.endOffset();
if (startOffset < invertState.lastStartOffset || endOffset < startOffset) {
throw new IllegalArgumentException(
"startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards "
+ "startOffset=" + startOffset + ",endOffset=" + endOffset + ",lastStartOffset="
+ invertState.lastStartOffset + " for field '" + field.name() + "'");
}
invertState.lastStartOffset = startOffset;
}
invertState.length++;
if (invertState.length < 0) {
throw new IllegalArgumentException("too many tokens in field '" + field.name() + "'");
}
// System.out.println(" term=" + invertState.termAttribute);
// If we hit an exception in here, we abort
// all buffered documents since the last
// flush, on the likelihood that the
// internal state of the terms hash is now
// corrupt and should not be flushed to a
// new segment:
try {
termsHashPerField.add(); //将term加入当前内存中的索引结构,主要有两个处理链,一个是FreqProxTermsWriterPerField,另一个是TermVectorsConsumerPerField
} catch (MaxBytesLengthExceededException e) {
byte[] prefix = new byte[30];
BytesRef bigTerm = invertState.termAttribute.getBytesRef();
System.arraycopy(bigTerm.bytes, bigTerm.offset, prefix, 0, 30);
String msg = "Document contains at least one immense term in field=\""
+ fieldInfo.name
+ "\" (whose UTF8 encoding is longer than the max length "
+ DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8
+ "), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '"
+ Arrays.toString(prefix) + "...', original message: " + e.getMessage();
if (docState.infoStream.isEnabled("IW")) {
docState.infoStream.message("IW", "ERROR: " + msg);
}
// Document will be deleted above:
throw new IllegalArgumentException(msg, e);
} catch (Throwable th) {
throw AbortingException.wrap(th);
}
}
// trigger streams to perform end-of-stream operations
stream.end(); //token流结束
// TODO: maybe add some safety? then again, it's already checked
// when we come back around to the field...
invertState.position += invertState.posIncrAttribute.getPositionIncrement();
invertState.offset += invertState.offsetAttribute.endOffset();
/* if there is an exception coming through, we won't set this to true here: */
succeededInProcessingField = true;
} finally {
if (!succeededInProcessingField && docState.infoStream.isEnabled("DW")) {
docState.infoStream.message("DW", "An exception was thrown while processing field " + fieldInfo.name);
}
}
if (analyzed) {
invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
}
invertState.boost *= field.boost();
}
}
termsHashPerField.add()方法比较重要,他负责利用token填充内存中的缓冲区,代码如下:
void add() throws IOException {
// We are first in the chain so we must "intern" the
// term text into textStart address
// Get the text & hash of this term.
int termID = bytesHash.add(termAtt.getBytesRef()); // 判断term是否存在,存在返回大于0,否则小于零
// System.out.println("add term=" + termBytesRef.utf8ToString() + " doc=" + docState.docID + " termID=" + termID);
if (termID >= 0) {// New posting
bytesHash.byteStart(termID);
// Init stream slices
if (numPostingInt + intPool.intUpto > IntBlockPool.INT_BLOCK_SIZE) {
intPool.nextBuffer(); // intPool分配空间
}
if (ByteBlockPool.BYTE_BLOCK_SIZE - bytePool.byteUpto < numPostingInt * ByteBlockPool.FIRST_LEVEL_SIZE) {
bytePool.nextBuffer(); // bytePool分配空间
}
intUptos = intPool.buffer;
intUptoStart = intPool.intUpto;
intPool.intUpto += streamCount;
postingsArray.intStarts[termID] = intUptoStart + intPool.intOffset; // 赋值intstart
for (int i = 0; i < streamCount; i++) {
final int upto = bytePool.newSlice(ByteBlockPool.FIRST_LEVEL_SIZE);
intUptos[intUptoStart + i] = upto + bytePool.byteOffset;
}
postingsArray.byteStarts[termID] = intUptos[intUptoStart]; // 赋值bytestart
newTerm(termID); // 新加一个term,调用freproxtermwriter
} else {
termID = (-termID) - 1;
int intStart = postingsArray.intStarts[termID];
intUptos = intPool.buffers[intStart >> IntBlockPool.INT_BLOCK_SHIFT];
intUptoStart = intStart & IntBlockPool.INT_BLOCK_MASK;
addTerm(termID);
}
if (doNextCall) {
nextPerField.add(postingsArray.textStarts[termID]); // 继续一次,在term处理时会调用termvectorconsumer,调用另外一个add函数
}
}
在这里涉及到了postingsArray,intPool,bytePool,这三者的关系可以用一张图片表示:
我尽量用一句话来解释,postingsArray是一个二维数组,每一行是一个term,textstart记录了这个term的值在bytepool中的偏移量,bytestart是结束地址,intstart是在intpool中的偏移量。intpool和byte是所有field共享的,postingsArray是每个field独有的,所以真正写入段的信息是通过bytepool和intpool,这一点很重要,所以操作这两个空间一定是串行的。而lucene的执行流程也保证了这一点。
具体解释可以查看这一篇文章http://blog.csdn.net/liweisnake/article/details/11364597
这里的关系比较难以搞懂,盗图一张:
接着上面的代码,addTerm()和newTerm()是具体写入bytepool和intpool的,而newTerm分别又是调用了freproxtermwriter以及termvectorconsumer中的newTerm()方法。下面先来看看freproxtermwriter.newTerm()
void newTerm(final int termID) {
// First time we're seeing this term since the last
// flush
final FreqProxPostingsArray postings = freqProxPostingsArray;
postings.lastDocIDs[termID] = docState.docID;
if (!hasFreq) {
assert postings.termFreqs == null;
postings.lastDocCodes[termID] = docState.docID;
} else {
postings.lastDocCodes[termID] = docState.docID << 1;
postings.termFreqs[termID] = 1;
if (hasProx) {
writeProx(termID, fieldState.position); //将位置信息写入bytepool
if (hasOffsets) {
writeOffsets(termID, fieldState.offset); //将位置偏移量写入bytepool
}
} else {
assert !hasOffsets;
}
}
fieldState.maxTermFrequency = Math.max(1, fieldState.maxTermFrequency);
fieldState.uniqueTermCount++;
}
termvectorconsumer.newTerm()的过程类似,也是写入intpool和bytepool,但是这两个newTerm写入的内容不一样,前者写入词频,位置信息,后者写入Term长度和term的内容。回到刚才的add函数,里面调用了 int termID = bytesHash.add(termAtt.getBytesRef()); 如果term存在则会返回一个小于0的数也就是-(termID+1),如果不存在会在bytepool中写入term的长度和值代码如下:
public int add(BytesRef bytes) {
assert bytesStart != null : "Bytesstart is null - not initialized";
final int length = bytes.length;
// final position
final int hashPos = findHash(bytes);
int e = ids[hashPos];
if (e == -1) {
// new entry
final int len2 = 2 + bytes.length;
if (len2 + pool.byteUpto > BYTE_BLOCK_SIZE) {
if (len2 > BYTE_BLOCK_SIZE) {
throw new MaxBytesLengthExceededException("bytes can be at most "
+ (BYTE_BLOCK_SIZE - 2) + " in length; got " + bytes.length);
}
pool.nextBuffer();
}
final byte[] buffer = pool.buffer;
final int bufferUpto = pool.byteUpto;
if (count >= bytesStart.length) {
bytesStart = bytesStartArray.grow();
assert count < bytesStart.length + 1 : "count: " + count + " len: "
+ bytesStart.length;
}
e = count++;
bytesStart[e] = bufferUpto + pool.byteOffset; //在二维数组中写入bytestart
// We first encode the length, followed by the
// bytes. Length is encoded as vInt, but will consume
// 1 or 2 bytes at most (we reject too-long terms,
// above).
if (length < 128) {
// 1 byte to store length
buffer[bufferUpto] = (byte) length; //term的长度
pool.byteUpto += length + 1;
assert length >= 0: "Length must be positive: " + length;
System.arraycopy(bytes.bytes, bytes.offset, buffer, bufferUpto + 1,
length); //term的值
} else {
// 2 byte to store length
buffer[bufferUpto] = (byte) (0x80 | (length & 0x7f));
buffer[bufferUpto + 1] = (byte) ((length >> 7) & 0xff);
pool.byteUpto += length + 2;
System.arraycopy(bytes.bytes, bytes.offset, buffer, bufferUpto + 2,
length);
}
assert ids[hashPos] == -1;
ids[hashPos] = e; //写入termID
if (count == hashHalfSize) {
rehash(2 * hashSize, true);
}
return e; //返回termID
}
return -(e + 1); //存在返回-(termID+1)的
}
至此可以看出这里面写入了倒排信息,具体怎么和索引文件对应,我将在下一篇文章中继续分析。
如有不正确,请指正。