Lucene與搜索引擎技術(index包詳解)

文章分類:互聯網

Index包分析

原創:windshow TjuAILab

Lucene索引中有幾個最基礎的概念,索引(index),文檔(document),域(field),和項(或者譯爲語詞term)

其中Index爲Document的序列

     Document爲Field的序列

     Field爲Term的序列

     Term就是一個子串.

存在於不同的Field中的同一個子串被認爲是不同的Term.因此Term實際上是用一對子串表示的,第一個子串爲Field的name,第二個爲Field中的子串.既然Term這麼重要,我們先來認識一下Term.

認識Term

最好的方法就是看其源碼錶示.

public final class Term implements Comparable, java.io.Serializable {

  String field;

  String text;

  public Term(String fld, String txt) {this(fld, txt, true);}

  public final String field() { return field; }

  public final String text() { return text; }

//overwrite equals()

  public final boolean equals(Object o) { }

//overwrite hashCode()

  public final int hashCode() {return field.hashCode() + text.hashCode();

  }









  public int compareTo(Object other) {return compareTo((Term)other);}

  public final int compareTo(Term other)

  final void set(String fld, String txt)  public final String toString() { return field + ":" + text; }

  private void readObject(java.io.ObjectInputStream in)

  }

從代碼中我們可以大體看出Tern其實是一個二元組<FieldName,text>

倒排索引
爲了使得基於項的搜索更有效率,索引中項是靜態存儲的。Lucene的索引屬於索引方式中的倒排索引,因爲對於一個項這種索引可以列出包含它的文檔。這剛好是文檔與項自然聯繫的倒置。

Field的類型
Lucene中,Field的文本可能以逐字的非倒排的方式存儲在索引中。而倒排過的Field稱爲被索引過了。Field也可能同時被存儲和被索引。Field的文本可能被分解許多Term而被索引,或者就被用作一個Term而被索引。大多數的Field是被分解過的,但是有些時候某些標識符域被當做一個Term索引是很有用的。

Index包中的每個類解析

CompoundFileReader

       提供讀取.cfs文件的方法.

CompoundFileWriter

       用來構建.cfs文件,從Lucene1.4開始,會將下面要提到的各類文件,譬如.tii,.tis等合併成一個.cfs文件!

       其結構如下

Compound (.cfs) --> FileCount, <DataOffset, FileName>FileCount, FileDataFileCount

FileCount --> VInt

DataOffset --> Long

FileName --> String

FileData --> raw file data

DocumentWriter

     構建.frq,.prx,.f文件  

1.FreqFile (.frq) --> <TermFreqs, SkipData>TermCount

TermFreqs --> <TermFreq>DocFreq

TermFreq --> DocDelta, Freq?

SkipData --> <SkipDatum>DocFreq/SkipInterval

SkipDatum --> DocSkip,FreqSkip,ProxSkip

DocDelta,Freq,DocSkip,FreqSkip,ProxSkip --> VInt









2.The .prx file contains the lists of positions that each term occurs at within documents.

ProxFile (.prx) --> <TermPositions>TermCount

TermPositions --> <Positions>DocFreq

Positions --> <PositionDelta>Freq

PositionDelta --> VInt









3.There's a norm file for each indexed field with a byte for each document. The .f[0-9]* file contains, for each document, a byte that encodes a value that is multiplied into the score for hits on that field:

Norms (.f[0-9]*) --> <Byte>SegSize

Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, and bits 3-8 contain the 5-bit exponent.

These are converted to an IEEE single float value as follows:

1.    If the byte is zero, use a zero float.

2.    Otherwise, set the sign bit of the float to zero;

3.    add 48 to the exponent and use this as the float's exponent;

4.    map the mantissa to the high-order 3 bits of the float's mantissa; and

5.    set the low-order 21 bits of the float's mantissa to zero.









FieldInfo

      裏邊有Field的部分信息,是一個四元組<name,isIndexed,num, storeTermVector>

FieldInfos

     此類用來描述Document的fields是否被索引.每個Segment有一個單獨的FieldInfo 文件.對於多線程,此類的對象爲線程安全的.但是某一時刻,只允許一個線程添加document.別的reader和writer不允許進入.此類維護兩個容器ArrayList和HashMap,這兩個容器都不是synchronized,何言線程安全,不解??

觀察write函數可知 .fnm文件的構成爲

     FieldInfos (.fnm) --> FieldsCount, <FieldName, FieldBits>FieldsCount

                        FieldsCount --> VInt

                        FieldName --> String

                        FieldBits --> Byte

FieldReader

    用來讀取.fdx文件和.fdt文件

FieldWriter

     此類創建兩個文件.fdx和.fdt文件

     FieldIndex(.fdx)對於每一個Document,裏面都含有一個指向Field的指針(其實是整數)

<FieldValuesPosition>SegSize

FieldValuesPosition --> Uint64

             則第n個document的Field pointer爲n*8

    FieldData(.fdt)裏面包含了每一個文檔包含的存儲的field信息.內容如下:

<DocFieldData>SegSize

DocFieldData --> FieldCount, <FieldNum, Bits, Value>FieldCount

FieldCount --> VInt

FieldNum --> VInt

Lucene <= 1.4:

Bits --> Byte

Value --> String

Only the low-order bit of Bits is used. It is one for tokenized fields, and zero for non-tokenized fields.

FilterIndexReader

     擴展自IndexReader,提供了具體的方法.

IndexReader

     爲abstract class!用來讀取建完索引的Directory,並可以返回各種信息,譬如Term,TermPosition等等.

IndexWriter

    IndexWriter用來創建和維護索引。

    IndexWriter構造函數中的第三個參數決定一個新的索引是否被創建,或者一個存在的索引是否開放給欲新加入的新的document

    通過addDocument()0函數加入新的documents,當添加完document之後,調用close()函數

    如果一個Index沒有document需要加入並且需要優化查詢性能。則在索引close()之前,調用optimize()函數進行優化。

    Deleteable文件結構:

    A file named "deletable" contains the names of files that are no longer used by the index, but which could not be deleted. This is only used on Win32, where a file may not be deleted while it is still open. On other platforms the file contains only null bytes.

Deletable --> DeletableCount, <DelableName>DeletableCount

DeletableCount --> UInt32

DeletableName --> String

MultipleTermPositions

專門用於search包中的PhrasePrefixQuery

MultiReader

擴展自IndexReader,用來讀取多個索引!添加他們的內容

SegmentInfo

     一些關於Segment的信息,是一個三元組<segmentname,docCount,dir>

SegmentInfos

     擴展自Vector,就是一個向量組,其中任意成員爲SegmentInfo!用來構建segments文件,每個Index有且只有一個這樣的文件,此類提供了read和write的方法.

     其內容如下:

     Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize>SegCount

Format, NameCounter, SegCount, SegSize --> UInt32

Version --> UInt64

SegName --> String

Format is -1 in Lucene 1.4.

Version counts how often the index has been changed by adding or deleting documents.

NameCounter is used to generate names for new segment files.

SegName is the name of the segment, and is used as the file name prefix for all of the files that compose the segment's index.

SegSize is the number of documents contained in the segment index.









SegmentMergeInfo

    用來記錄segment合併信息.

SegmentMergeQueue

    擴展自PriorityQueue(按升序排列)

SegmentMerger

此類合併多個Segment爲一個Segment,被IndexWriter.addIndexes()創建此類對象

如果compoundFile爲True即可以合併了,創建.cfs文件,並且把其餘的幾乎所有文件全部合併到.cfs文件中!

SegmentReader

擴展自IndexReader,提供了很多讀取Index的方法

SegmentTermDocs

擴展自TermDocs

SegmentTermEnum

   擴展自TermEnum

SegmentTermPositions

   擴展自TermPositions

SegmentTermVector

  擴展自TermFreqVector

Term

     Term是一個<fieldName,text>對.而Field由於分多種,但是至少都含有<fieldName,fieldValue>這樣二者就可以建立關聯了.Term是一個搜索單元.Term的text都是諸如dates,email address,urls等等.

TermDocs

     TermDocs是一個Interface. TermDocs提供一個接口,用來列舉<document,frequency>,以共Term使用

     在<document,frequency>對中,document部分給每一個含有term的document命名.document根據其document number進行標引.frequency部分列舉在每一個document中term的數量.<document,frequency>對根據document number排序.

TermEnum

     此類爲抽象類,用來enumerate term.Term enumerations 由Term.compareTo()進行排序此enumeration中的每一個term都要大於所有在此enumeration之前的term.

TermFreqVector

     此Interface用來訪問一個document的Field的Term Vector

TermInfo

     此類主要用來存儲Term信息.其可以說爲一個五元組<Term,docFreq,freqPointer,proxPointer,skipOffset>

TermInfoReader

     未細讀,待讀完SegmentTermEnum

TermInfoWriter

     此類用來構建(.tis)和(.tii)文件.這些構成了term dictionary

1.     The term infos, or tis file.

TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos

TIVersion --> UInt32

TermCount --> UInt64

IndexInterval --> UInt32

SkipInterval --> UInt32

TermInfos --> <TermInfo>TermCount

TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>

Term --> <PrefixLength, Suffix, FieldNum>

Suffix --> String

PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VInt

This file is sorted by Term. Terms are ordered first lexicographically by the term's field name, and within that lexicographically by the term's text.

TIVersion names the version of the format of this file and is -2 in Lucene 1.4.

Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".

FieldNumber determines the term's field, whose name is stored in the .fdt file.

DocFreq is the count of documents which contain the term.

FreqDelta determines the position of this term's TermFreqs within the .frq file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file).

ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file.

SkipDelta determines the position of this term's SkipData within the .frq file. In particular, it is the number of bytes after TermFreqs that the SkipData starts. In other words, it is the length of the TermFreq data.

2.     The term info index, or .tii file.

This contains every IndexIntervalth entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.

The structure of this file is very similar to the .tis file, with the addition of one item per record, the IndexDelta.

TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, TermIndices

TIVersion --> UInt32

IndexTermCount --> UInt64

IndexInterval --> UInt32

SkipInterval --> UInt32

TermIndices --> <TermInfo, IndexDelta>IndexTermCount

IndexDelta --> VLong

IndexDelta determines the position of this term's TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry.

TODO: document skipInterval information

             其中IndexDelta是.tii文件,比之.tis文件多的東西.

TermPosition

      此類擴展自TermDocs,是一個Interface,用來enumerate<document,frequency,<position>*>三元組,

以供term使用.在此三元組中document和frequency於TernDocs中的相同.postions部分列出了在一個document中,一個term每次出現的順序位置此三元組爲倒排文檔的事件表表示.

TermPositionVector

      擴展自TermFreqVector.比之TermFreqVector擴展了功能,可以提供term所在的位置

TermVectorReader

      用來讀取.tvd,.tvf.tvx三個文件.

TermVectorWriter

      用於構建.tvd, .tvf,.tvx文件,這三個文件構成TermVector

1.    The Document Index or .tvx file.

This contains, for each document, a pointer to the document data in the Document (.tvd) file.

DocumentIndex (.tvx) --> TVXVersion<DocumentPosition>NumDocs

TVXVersion --> Int

DocumentPosition --> UInt64

This is used to find the position of the Document in the .tvd file.

2.    The Document or .tvd file.

This contains, for each document, the number of fields, a list of the fields with term vector info and finally a list of pointers to the field information in the .tvf (Term Vector Fields) file.

Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,>NumDocs

TVDVersion --> Int

NumFields --> VInt

FieldNums --> <FieldNumDelta>NumFields

FieldNumDelta --> VInt

FieldPositions --> <FieldPosition>NumFields

FieldPosition --> VLong

The .tvd file is used to map out the fields that have term vectors stored and where the field information is in the .tvf file.

3.    The Field or .tvf file.

This file contains, for each field that has a term vector stored, a list of the terms and their frequencies.

Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs>NumFields

TVFVersion --> Int

NumTerms --> VInt

NumDistinct --> VInt -- Future Use

TermFreqs --> <TermText, TermFreq>NumTerms

TermText --> <PrefixLength, Suffix>

PrefixLength --> VInt

Suffix --> String

TermFreq --> VInt

Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".

好的,整個Index包所有類都講解了,下邊咱們開始來編碼重新審視一下!

下邊來編制一個程序來結束本章的討論。

package org.apache.lucene.index;

import org.apache.lucene.analysis.*;

import org.apache.lucene.analysis.standard.*;

import org.apache.lucene.store.*;

import org.apache.lucene.document.*;

import org.apache.lucene.demo.*;

import org.apache.lucene.search.*;

import java.io.*;

/**在使用此程序時,會盡量用到Lucene Index中的每一個類,儘量將其展示個大家

*使用的Index包中類有

*DocumentWriter(提供給用用戶使用的爲IndexWriter)

*FieldInfo(和FieldInfos)

* SegmentDocs(擴展自TermDocs)

*SegmentReader(擴展自IndexReader,提供給用戶使用的是IndexReader)

*SegmentMerger

*segmentTermEnum(擴展自TermEnum)

*segmentTermPositions(擴展自TermPositions)

*segmentTermVector(擴展自TermFreqVector)

*/









public class TestIndexPackage

{

  //用於將Document加入索引

  public static void indexDocument(String segment,String fileName) throws Exception

  {

    //第二個參數用來控制,如果獲得不了目錄是否創建

    Directory directory = FSDirectory.getDirectory("testIndexPackage",false);

    Analyzer analyzer = new SimpleAnalyzer();

    //第三個參數爲每一個Field最多擁有的Token個數

    DocumentWriter writer = new DocumentWriter(directory,analyzer,Similarity.getDefault(),1000);

    File file = new File(fileName);

    //由於使用FileDocument將file包裝成了Docuement,會在document中創建三個field(path,modified,contents)

    Document doc = FileDocument.Document(file);

    writer.addDocument(segment,doc);

    directory.close();

  }

  //將多個segment進行合併

  public static void merge(String segment1,String segment2,String segmentMerged)throws Exception

  {

    Directory directory = FSDirectory.getDirectory("testIndexPackage",false);

    SegmentReader segmentReader1=new SegmentReader(new SegmentInfo(segment1,1,directory));

    SegmentReader segmentReader2=new SegmentReader(new SegmentInfo(segment2,1,directory));

    //第三個參數爲是否創建.cfs文件

    SegmentMerger segmentMerger =new SegmentMerger(directory,segmentMerged,false);

    segmentMerger.add(segmentReader1);

    segmentMerger.add(segmentReader2);

    segmentMerger.merge();

    segmentMerger.closeReaders();

    directory.close();

  }

  //將segment即Index的子索引的所有內容展示給你看。

  public static void printSegment(String segment) throws Exception

  {

    Directory directory =FSDirectory.getDirectory("testIndexPackage",false);

    SegmentReader segmentReader = new SegmentReader(new SegmentInfo(segment,1,directory));

    //display documents

    for(int i=0;i<segmentReader.numDocs();i++)

      System.out.println(segmentReader.document(i));

    TermEnum termEnum = segmentReader.terms();//此處實際爲SegmentTermEnum

    //display term and term positions,termDocs

    while(termEnum.next())

    {

      System.out.print(termEnum.term().toString2());

      System.out.println(" DocumentFrequency=" + termEnum.docFreq());

      TermPositions termPositions= segmentReader.termPositions(termEnum.term());

      int i=0;

      while(termPositions.next())

      {

        System.out.println((i++)+"->"+termPositions);

      }

      TermDocs termDocs=segmentReader.termDocs(termEnum.term());//實際爲segmentDocs

      while (termDocs.next())

      {

        System.out.println((i++)+"->"+termDocs);

      }









    }

    //display field info

    FieldInfos fieldInfos= segmentReader.fieldInfos;

    FieldInfo pathFieldInfo = fieldInfos.fieldInfo("path");

    FieldInfo modifiedFieldInfo = fieldInfos.fieldInfo("modified");

    FieldInfo contentsFieldInfo =fieldInfos.fieldInfo("contents");

    System.out.println(pathFieldInfo);

    System.out.println(modifiedFieldInfo);

    System.out.println(contentsFieldInfo);

   //display TermFreqVector

   for(int i=0;i<segmentReader.numDocs();i++)

   {

     //對contents的token之後的term存於了TermFreqVector

     TermFreqVector termFreqVector=segmentReader.getTermFreqVector(i,"contents");

     System.out.println(termFreqVector);

   }

  }

  public static void main(String [] args)

  {

    try

    {

      Directory directory = FSDirectory.getDirectory("testIndexPackage",true);

      directory.close();

      indexDocument("segmentOne","e://lucene//test.txt");

      //printSegment("segmentOne");

      indexDocument("segmentTwo","e://lucene//test2.txt");

     // printSegment("segmentTwo");

      merge("segmentOne","segmentTwo","merge");

      printSegment("merge");

    }

    catch(Exception e)

    {

      System.out.println("caught a "+e.getCause()+"/n with message:"+e.getMessage());

      e.printStackTrace();

    }









  }

}

看看其結果如下:

Document<Text<path:e:/lucene/test.txt> Keyword<modified:0eg4e221c>>

Document<Text<path:e:/lucene/test2.txt> Keyword<modified:0eg4ee8b4>>

<Term:FieldName,text>=<contents,china> DocumentFrequency=1

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=2>

1-><docNumber,freq>=<0,1>

<Term:FieldName,text>=<contents,i> DocumentFrequency=2

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=2 Pos=0,3>

1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=0>

2-><docNumber,freq>=<0,2>

3-><docNumber,freq>=<1,1>

<Term:FieldName,text>=<contents,love> DocumentFrequency=2

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=2 Pos=1,4>

1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=1>

2-><docNumber,freq>=<0,2>

3-><docNumber,freq>=<1,1>

<Term:FieldName,text>=<contents,nankai> DocumentFrequency=1

0-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=2>

1-><docNumber,freq>=<1,1>

<Term:FieldName,text>=<contents,tianjin> DocumentFrequency=1

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=5>

1-><docNumber,freq>=<0,1>

<Term:FieldName,text>=<modified,0eg4e221c> DocumentFrequency=1

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=0>

1-><docNumber,freq>=<0,1>

<Term:FieldName,text>=<modified,0eg4ee8b4> DocumentFrequency=1

0-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=0>

1-><docNumber,freq>=<1,1>

<Term:FieldName,text>=<path,e> DocumentFrequency=2

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=0>

1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=0>

2-><docNumber,freq>=<0,1>

3-><docNumber,freq>=<1,1>

<Term:FieldName,text>=<path,lucene> DocumentFrequency=2

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=1>

1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=1>

2-><docNumber,freq>=<0,1>

3-><docNumber,freq>=<1,1>

<Term:FieldName,text>=<path,test> DocumentFrequency=2

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=2>

1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=2>

2-><docNumber,freq>=<0,1>

3-><docNumber,freq>=<1,1>

<Term:FieldName,text>=<path,txt> DocumentFrequency=2

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=3>

1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=3>

2-><docNumber,freq>=<0,1>

3-><docNumber,freq>=<1,1>

<fieldName,isIndexed,fieldNumber,storeTermVector>=path,true,3,false>

<fieldName,isIndexed,fieldNumber,storeTermVector>=modified,true,2,false>

<fieldName,isIndexed,fieldNumber,storeTermVector>=contents,true,1,true>

{contents: china/1, i/2, love/2, tianjin/1}

{contents: i/1, love/1, nankai/1}

認真審視其結果,你就會更加明白Lucene底層的索引結構如何。

參考資料:Lucene File Format
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章