Lucene學習總結之四：Lucene索引過程分析(1)

對於Lucene的索引過程，除了將詞(Term)寫入倒排表並最終寫入Lucene的索引文件外，還包括分詞(Analyzer)和合並段(merge segments)的過程，本次不包括這兩部分，將在以後的文章中進行分析。

Lucene的索引過程，很多的博客，文章都有介紹，推薦大家上網搜一篇文章：《Annotated Lucene》，好像中文名稱叫《Lucene源碼剖析》是很不錯的。

想要真正瞭解Lucene索引文件過程，最好的辦法是跟進代碼調試，對着文章看代碼，這樣不但能夠最詳細準確的掌握索引過程(描述都是有偏差的，而代碼是不會騙你的)，而且還能夠學習Lucene的一些優秀的實現，能夠在以後的工作中爲我所用，畢竟Lucene是比較優秀的開源項目之一。

由於Lucene已經升級到3.0.0了，本索引過程爲Lucene 3.0.0的索引過程。

一、索引過程體系結構

Lucene 3.0的搜索要經歷一個十分複雜的過程，各種信息分散在不同的對象中分析，處理，寫入，爲了支持多線程，每個線程都創建了一系列類似結構的對象集，爲了提高效率，要複用一些對象集，這使得索引過程更加複雜。

其實索引過程，就是經歷下圖中所示的索引鏈的過程，索引鏈中的每個節點，負責索引文檔的不同部分的信息，當經歷完所有的索引鏈的時候，文檔就處理完畢了。最初的索引鏈，我們稱之基本索引鏈 。

爲了支持多線程，使得多個線程能夠併發處理文檔，因而每個線程都要建立自己的索引鏈體系，使得每個線程能夠獨立工作，在基本索引鏈基礎上建立起來的每個線程獨立的索引鏈體系，我們稱之線程索引鏈 。線程索引鏈的每個節點是由基本索引鏈中的相應的節點調用函數addThreads創建的。

爲了提高效率，考慮到對相同域的處理有相似的過程，應用的緩存也大致相當，因而不必每個線程在處理每一篇文檔的時候都重新創建一系列對象，而是複用這些對象。所以對每個域也建立了自己的索引鏈體系，我們稱之域索引鏈 。域索引鏈的每個節點是由線程索引鏈中的相應的節點調用addFields創建的。

當完成對文檔的處理後，各部分信息都要寫到索引文件中，寫入索引文件的過程是同步的，不是多線程的，也是沿着基本索引鏈將各部分信息依次寫入索引文件的。

下面詳細分析這一過程。

二、詳細索引過程

1、創建IndexWriter對象

代碼：

IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);

IndexWriter對象主要包含以下幾方面的信息：

用於索引文檔
- Directory directory; 指向索引文件夾
- Analyzer analyzer; 分詞器
- Similarity similarity = Similarity.getDefault(); 影響打分的標準化因子(normalization factor)部分，對文檔的打分分兩個部分，一部分是索引階段計算的，與查詢語句無關，一部分是搜索階段計算的，與查詢語句相關。
- SegmentInfos segmentInfos = new SegmentInfos(); 保存段信息，大家會發現，和segments_N中的信息幾乎一一對應。
- IndexFileDeleter deleter; 此對象不是用來刪除文檔的，而是用來管理索引文件的。
- Lock writeLock; 每一個索引文件夾只能打開一個IndexWriter，所以需要鎖。
- Set segmentsToOptimize = new HashSet(); 保存正在最優化(optimize)的段信息。當調用optimize的時候，當前所有的段信息加入此Set，此後新生成的段並不參與此次最優化。
用於合併段，在合併段的文章中將詳細描述
- SegmentInfos localRollbackSegmentInfos;
- HashSet mergingSegments = new HashSet();
- MergePolicy mergePolicy = new LogByteSizeMergePolicy(this);
- MergeScheduler mergeScheduler = new ConcurrentMergeScheduler();
- LinkedList pendingMerges = new LinkedList();
- Set runningMerges = new HashSet();
- List mergeExceptions = new ArrayList();
- long mergeGen;
爲保持索引完整性，一致性和事務性
- SegmentInfos rollbackSegmentInfos; 當IndexWriter對索引進行了添加，刪除文檔操作後，可以調用commit將修改提交到文件中去，也可以調用rollback取消從上次commit到此時的修改。
- SegmentInfos localRollbackSegmentInfos; 此段信息主要用於將其他的索引文件夾合併到此索引文件夾的時候，爲防止合併到一半出錯可回滾所保存的原來的段信息。
一些配置
- long writeLockTimeout; 獲得鎖的時間超時。當超時的時候，說明此索引文件夾已經被另一個IndexWriter打開了。
- int termIndexInterval; 同tii和tis文件中的indexInterval。

有關SegmentInfos對象所保存的信息：

當索引文件夾如下的時候，SegmentInfos對象如下表

segmentInfos    SegmentInfos (id=37)
    capacityIncrement    0
    counter    3
    elementCount    3
    elementData    Object[10] (id=68)
        [0]    SegmentInfo (id=166)
            delCount    0
            delGen    -1
            diagnostics    HashMap (id=170)
            dir    SimpleFSDirectory (id=171)
            docCount    2
            docStoreIsCompoundFile    false
            docStoreOffset    -1
            docStoreSegment    null
            files    ArrayList (id=173)
            hasProx    true
            hasSingleNormFile    true
            isCompoundFile    1
            name    "_0"
            normGen    null
            preLockless    false
            sizeInBytes    635
        [1]    SegmentInfo (id=168)
            delCount    0
            delGen    -1
            diagnostics    HashMap (id=177)
            dir    SimpleFSDirectory (id=171)
            docCount    2
            docStoreIsCompoundFile    false
            docStoreOffset    -1
            docStoreSegment    null
            files    ArrayList (id=178)
            hasProx    true
            hasSingleNormFile    true
            isCompoundFile    1
            name    "_1"
            normGen    null
            preLockless    false
            sizeInBytes    635
        [2]    SegmentInfo (id=169)
            delCount    0
            delGen    -1
            diagnostics    HashMap (id=180)
            dir    SimpleFSDirectory (id=171)
            docCount    2
            docStoreIsCompoundFile    false
            docStoreOffset    -1
            docStoreSegment    null
            files    ArrayList (id=214)
            hasProx    true
            hasSingleNormFile    true
            isCompoundFile    1
            name    "_2"
            normGen    null
            preLockless    false
            sizeInBytes    635
    generation    4
    lastGeneration    4
    modCount    3
    pendingSegnOutput    null
    userData    HashMap (id=146)
    version    1263044890832

有關IndexFileDeleter：

其不是用來刪除文檔的，而是用來管理索引文件的。
在對文檔的添加，刪除，對段的合併的處理過程中，會生成很多新的文件，並需要刪除老的文件，因而需要管理。
然而要被刪除的文件又可能在被用，因而要保存一個引用計數，僅僅當引用計數爲零的時候，才執行刪除。
下面這個例子能很好的說明IndexFileDeleter如何對文件引用計數並進行添加和刪除的。

(1) 創建IndexWriter時

IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
writer.setMergeFactor(3);

索引文件夾如下：

引用計數如下：

refCounts    HashMap (id=101)
    size    1
    table    HashMap$Entry[16] (id=105)
        [8]    HashMap$Entry (id=110)
            key    "segments_1"
            value    IndexFileDeleter$RefCount (id=38)
                count    1

(2) 添加第一個段時

indexDocs(writer, docDir);
writer.commit();

首先生成的不是compound文件

因而引用計數如下：

refCounts    HashMap (id=101)
    size    9
    table    HashMap$Entry[16] (id=105)
        [1]    HashMap$Entry (id=129)
            key    "_0.tis"
            value    IndexFileDeleter$RefCount (id=138)
                count    1
        [3]    HashMap$Entry (id=130)
            key    "_0.fnm"
            value    IndexFileDeleter$RefCount (id=141)
                count    1
        [4]    HashMap$Entry (id=134)
            key    "_0.tii"
            value    IndexFileDeleter$RefCount (id=142)
                count    1
        [8]    HashMap$Entry (id=135)
            key    "_0.frq"
            value    IndexFileDeleter$RefCount (id=143)
                count    1
        [10]    HashMap$Entry (id=136)
            key    "_0.fdx"
            value    IndexFileDeleter$RefCount (id=144)
                count    1
        [13]    HashMap$Entry (id=139)
            key    "_0.prx"
            value    IndexFileDeleter$RefCount (id=145)
                count    1
        [14]    HashMap$Entry (id=140)
            key    "_0.fdt"
            value    IndexFileDeleter$RefCount (id=146)
                count    1

然後會合併成compound文件，並加入引用計數

refCounts    HashMap (id=101)
    size    10
    table    HashMap$Entry[16] (id=105)
        [1]    HashMap$Entry (id=129)
            key    "_0.tis"
            value    IndexFileDeleter$RefCount (id=138)
                count    1
        [2]    HashMap$Entry (id=154)
            key    "_0.cfs"
            value    IndexFileDeleter$RefCount (id=155)
                count    1
        [3]    HashMap$Entry (id=130)
            key    "_0.fnm"
            value    IndexFileDeleter$RefCount (id=141)
                count    1
        [4]    HashMap$Entry (id=134)
            key    "_0.tii"
            value    IndexFileDeleter$RefCount (id=142)
                count    1
        [8]    HashMap$Entry (id=135)
            key    "_0.frq"
            value    IndexFileDeleter$RefCount (id=143)
                count    1
        [10]    HashMap$Entry (id=136)
            key    "_0.fdx"
            value    IndexFileDeleter$RefCount (id=144)
                count    1
        [13]    HashMap$Entry (id=139)
            key    "_0.prx"
            value    IndexFileDeleter$RefCount (id=145)
                count    1
        [14]    HashMap$Entry (id=140)
            key    "_0.fdt"
            value    IndexFileDeleter$RefCount (id=146)
                count    1

然後會用IndexFileDeleter.decRef()來刪除[_0.nrm, _0.tis, _0.fnm, _0.tii, _0.frq, _0.fdx, _0.prx, _0.fdt]文件

refCounts    HashMap (id=101)
    size    2
    table    HashMap$Entry[16] (id=105)
        [2]    HashMap$Entry (id=154)
            key    "_0.cfs"
            value    IndexFileDeleter$RefCount (id=155)
                count    1
        [8]    HashMap$Entry (id=110)
            key    "segments_1"
            value    IndexFileDeleter$RefCount (id=38)
                count    1

然後爲建立新的segments_2

refCounts    HashMap (id=77)
    size    3
    table    HashMap$Entry[16] (id=84)
        [2]    HashMap$Entry (id=87)
            key    "_0.cfs"
            value    IndexFileDeleter$RefCount (id=91)
                count    3
        [8]    HashMap$Entry (id=89)
            key    "segments_1"
            value    IndexFileDeleter$RefCount (id=62)
                count    0
        [9]    HashMap$Entry (id=90)
            key    "segments_2"
            next    null
            value    IndexFileDeleter$RefCount (id=93)
                count    1

然後IndexFileDeleter.decRef() 刪除segments_1文件

refCounts    HashMap (id=77)
    size    2
    table    HashMap$Entry[16] (id=84)
        [2]    HashMap$Entry (id=87)
            key    "_0.cfs"
            value    IndexFileDeleter$RefCount (id=91)
                count    2
        [9]    HashMap$Entry (id=90)
            key    "segments_2"
            value    IndexFileDeleter$RefCount (id=93)
                count    1

(3) 添加第二個段

indexDocs(writer, docDir);
writer.commit();

(4) 添加第三個段，由於MergeFactor爲3，則會進行一次段合併。

indexDocs(writer, docDir);
writer.commit();

首先和其他的段一樣，生成_2.cfs以及segments_4

同時創建了一個線程來進行背後進行段合併(ConcurrentMergeScheduler$MergeThread.run())

這時候的引用計數如下

refCounts    HashMap (id=84)
    size    5
    table    HashMap$Entry[16] (id=98)
        [2]    HashMap$Entry (id=112)
            key    "_0.cfs"
            value    IndexFileDeleter$RefCount (id=117)
                count    1
        [4]    HashMap$Entry (id=113)
            key    "_3.cfs"
            value    IndexFileDeleter$RefCount (id=118)
                count    1
        [12]    HashMap$Entry (id=114)
            key    "_1.cfs"
            value    IndexFileDeleter$RefCount (id=119)
                count    1
        [13]    HashMap$Entry (id=115)
            key    "_2.cfs"
            value    IndexFileDeleter$RefCount (id=120)
                count    1
        [15]    HashMap$Entry (id=116)
            key    "segments_4"
            value    IndexFileDeleter$RefCount (id=121)
                count    1

(5) 關閉writer

writer.close();

通過IndexFileDeleter.decRef()刪除被合併的段

有關SimpleFSLock進行JVM之間的同步：

有時候，我們寫java程序的時候，也需要不同的JVM之間進行同步，來保護一個整個系統中唯一的資源。
如果唯一的資源僅僅在一個進程中，則可以使用線程同步的機制
然而如果唯一的資源要被多個進程進行訪問，則需要進程間同步的機制，無論是Windows和Linux在操作系統層面都有很多的進程間同步的機制。
但進程間的同步卻不是Java的特長，Lucene的SimpleFSLock給我們提供了一種方式。

Lock的抽象類

public abstract class Lock {

public static long LOCK_POLL_INTERVAL = 1000;

public static final long LOCK_OBTAIN_WAIT_FOREVER = -1;

public abstract boolean obtain() throws IOException;

public boolean obtain(long lockWaitTimeout) throws LockObtainFailedException, IOException {

boolean locked = obtain();

if (lockWaitTimeout < 0 && lockWaitTimeout != LOCK_OBTAIN_WAIT_FOREVER)
throw new IllegalArgumentException("...");

long maxSleepCount = lockWaitTimeout / LOCK_POLL_INTERVAL;

long sleepCount = 0;

while (!locked) {

      if (lockWaitTimeout != LOCK_OBTAIN_WAIT_FOREVER && sleepCount++ >= maxSleepCount) {
        throw new LockObtainFailedException("Lock obtain timed out.");
      }
      try {
        Thread.sleep(LOCK_POLL_INTERVAL);
      } catch (InterruptedException ie) {
        throw new ThreadInterruptedException(ie);
      }
      locked = obtain();
    }
    return locked;
}

public abstract void release() throws IOException;

public abstract boolean isLocked() throws IOException;

}

LockFactory的抽象類

public abstract class LockFactory {

public abstract Lock makeLock(String lockName);

abstract public void clearLock(String lockName) throws IOException;
}

SimpleFSLock的實現類

class SimpleFSLock extends Lock {

File lockFile;
File lockDir;

public SimpleFSLock(File lockDir, String lockFileName) {
this.lockDir = lockDir;
lockFile = new File(lockDir, lockFileName);
}

@Override
public boolean obtain() throws IOException {

if (!lockDir.exists()) {

if (!lockDir.mkdirs())
throw new IOException("Cannot create directory: " + lockDir.getAbsolutePath());

} else if (!lockDir.isDirectory()) {

throw new IOException("Found regular file where directory expected: " + lockDir.getAbsolutePath());
}

return lockFile.createNewFile();

}

@Override
public void release() throws LockReleaseFailedException {

if (lockFile.exists() && !lockFile.delete())
throw new LockReleaseFailedException("failed to delete " + lockFile);

}

@Override
public boolean isLocked() {

return lockFile.exists();

}

SimpleFSLockFactory的實現類

public class SimpleFSLockFactory extends FSLockFactory {

public SimpleFSLockFactory(String lockDirName) throws IOException {

setLockDir(new File(lockDirName));

}

@Override
public Lock makeLock(String lockName) {

if (lockPrefix != null) {

lockName = lockPrefix + "-" + lockName;

}

return new SimpleFSLock(lockDir, lockName);

}

@Override
public void clearLock(String lockName) throws IOException {

if (lockDir.exists()) {

if (lockPrefix != null) {

lockName = lockPrefix + "-" + lockName;

}

File lockFile = new File(lockDir, lockName);

if (lockFile.exists() && !lockFile.delete()) {

throw new IOException("Cannot delete " + lockFile);

}

};

2、創建文檔Document對象，並加入域(Field)

代碼：

Document doc = new Document();

doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));

doc.add(new Field("modified",DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.NOT_ANALYZED));

doc.add(new Field("contents", new FileReader(f)));

Document對象主要包括以下部分：

此文檔的boost，默認爲1，大於一說明比一般的文檔更加重要，小於一說明更不重要。
一個ArrayList保存此文檔所有的域
每一個域包括域名，域值，和一些標誌位，和fnm，fdx，fdt中的描述相對應。

doc    Document (id=42)
    boost    1.0
    fields    ArrayList (id=44)
        elementData    Object[10] (id=46)
            [0]    Field (id=48)
                binaryLength    0
                binaryOffset    0
                boost    1.0
                fieldsData    "exampledocs//file01.txt"
                isBinary    false
                isIndexed    true
                isStored    true
                isTokenized    false
                lazy    false
                name    "path"
                omitNorms    false
                omitTermFreqAndPositions    false
                storeOffsetWithTermVector    false
                storePositionWithTermVector    false
                storeTermVector    false
                tokenStream    null
            [1]    Field (id=50)
                binaryLength    0
                binaryOffset    0
                boost    1.0
                fieldsData    "200910240957"
                isBinary    false
                isIndexed    true
                isStored    true
                isTokenized    false
                lazy    false
                name    "modified"
                omitNorms    false
                omitTermFreqAndPositions    false
                storeOffsetWithTermVector    false
                storePositionWithTermVector    false
                storeTermVector    false
                tokenStream    null
            [2]    Field (id=52)
                binaryLength    0
                binaryOffset    0
                boost    1.0
                fieldsData    FileReader (id=58)
                isBinary    false
                isIndexed    true
                isStored    false
                isTokenized    true
                lazy    false
                name    "contents"
                omitNorms    false
                omitTermFreqAndPositions    false
                storeOffsetWithTermVector    false
                storePositionWithTermVector    false
                storeTermVector    false
                tokenStream    null
        modCount    3
        size    3

uniorg

發佈了33 篇原創文章 · 獲贊 1 · 訪問量 15萬+

私信關注

Lucene學習總結之四：Lucene索引過程分析(1)

一、索引過程體系結構

二、詳細索引過程

1、創建IndexWriter對象

2、創建文檔Document對象，並加入域(Field)

lucene的索引文件結構

html文件的圖標變成應用程序圖標怎麼辦

理解static執行順序

Lucene學習總結之三：Lucene的索引文件格式(3)

Lucene的查詢語句用法

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結