Lucene與搜索引擎技術(Document包詳解）

Document 包分析

理解 Document

Lucene 沒有定義數據源 , 而是定義了一個通用的文檔結構 , 這個文檔結構就是 LuceneDocument 包下的 Document 類 .

一個 Document 對應於你在進行網頁抓取的時候一個 msword, 一個 pdf, 一個 html, 一個 text 等 .Lucene 的這種形式可以定義

非常靈活的應用 , 只要前端有相應的轉換器把數據源轉成 Document 結構就可以了 .

一個 Document 內部維護一個 Field 的 vector.

好 , 我們一起來看一下 document 的核心源碼 ( 只有定義 , 沒有實現 )

public final class Document implements java.io.Serializable {

List fields = new Vector();// 成員變量

//boost 用來表示此 document 的重要程度 , 默認爲 1.0, 會作用於 document 中的所有的 field

private float boost = 1.0f;

public Document() {}

public void setBoost(float boost) {this.boost = boost;}

public float getBoost() {return boost;}

public final void add(Field field)

public final void removeField(String name)

public final void removeFields(String name)

public final Field getField(String name)

public final String get(String name)

public final Enumeration fields()

public final Field[] getFields(String name)

public final String[] getValues(String name)

public final String toString()

理解 Field

剛纔提到一個 Document 中有一個用來存儲 Field 的 vector, 那麼什麼是 Field. 你可以簡單的認爲 Field 是一個 <name,value>

name 爲域（ Field ）的名字，例如 title ， body ， subject ， data 等等。 value 就是文本。我們來看一下源碼定義 , 不就 OK 了 .

( 由於 Field 是 Lucene 中非常重要的概念 , 所以我們拿來源碼看一下 )

public final class Field implements java.io.Serializable {

private String name = "body";

private String stringValue = null;

private boolean storeTermVector = false;

private Reader readerValue = null;

private boolean isStored = false;

private boolean isIndexed = true;

private boolean isTokenized = true;

/* 以前一直不瞭解 boost 爲何？其實 boost 就是由於後來進行相關度排序時用的 , 由於在 query 時，

* 每個 term 都分屬與一個 field 。同樣的 term 當其屬於不同的 field 時，其重要性不一樣，譬如

*field:<title> 中的 term 就要比 field:<content> 中的 term 重要！而這個重要性如何體現就

* 可以通過 boost 進行設定。可以把 field:<title> 的 boost 至設大一些

* 注意 boost 在 Document 中還有整個的設定 .

private float boost = 1.0f;

public void setBoost(float boost) {this.boost = boost;}

public float getBoost() { return boost;}

public static final Field Keyword(String name, String value) {return new Field(name, value, true, true, false);}

public static final Field UnIndexed(String name, String value) {return new Field(name, value, true, false, false);}

public static final Field Text(String name, String value) {return Text(name, value, false);}

public static final Field Keyword(String name, Date value) {return new Field(name, DateField.dateToString(value), true, true, false);}

public static final Field Text(String name, String value, boolean storeTermVector) {

return new Field(name, value, true, true, true, storeTermVector);}

public static final Field UnStored(String name, String value) {

return UnStored(name, value, false);}

public static final Field UnStored(String name, String value, boolean storeTermVector) {

return new Field(name, value, false, true, true, storeTermVector); }

public static final Field Text(String name, Reader value) {

return Text(name, value, false);}

public static final Field Text(String name, Reader value, boolean storeTermVector) {

Field f = new Field(name, value);

f.storeTermVector = storeTermVector;

return f;

}

public String name() { return name; }

public String stringValue() { return stringValue; }

public Reader readerValue() { return readerValue; }

public Field(String name, String string,

boolean store, boolean index, boolean token) {

this(name, string, store, index, token, false);

}

// 最低層的構造函數

public Field(String name, String string,

boolean store, boolean index, boolean token, boolean storeTermVector)

Field(String name, Reader reader)

public final boolean isStored() { return isStored; }

public final boolean isIndexed() { return isIndexed; }

public final boolean isTokenized() { return isTokenized; }

public final boolean isTermVectorStored() { return storeTermVector; }

public final String toString()

public final String toString2()// 我加的用來返回六元組

}

代碼可能看起來有點長 , 不過看一下就知道了 Field 其實是一個六元組 , 咱們上文說其是 <name,value> 對是一種簡化形式 .

Field 的六元組形式爲 <name,stringValue,isStored,isIndexed,isTokenized,isTermVectorStored>,Field 提供了不同的構造函數

主要有一下幾個

方法	切詞	索引	存儲	用途
Field.Text(String name, String value)	Yes	Yes	Yes	切分 , 索引 , 並存儲，比如： title ， subject
Field Text(String name, Reader value)	Yes	Yes	Yes	與上面同 , Term Vector 並不存儲此 Field
Field Text(String name, String value, boolean storeTermVector)	Yes	Yes	Yes	切分 , 索引 , 存儲，比如： title,subject. 於上面不同的加入了一個控制變量
Field Text(String name, Reader value, boolean storeTermVector)	Yes	Yes	Yes	切分 , 索引 , 存儲，比如： title,subject. 於上面不同的加入了一個控制變量
Field.Keyword(String name, String value)	No	Yes	Yes	不切分 , 索引 , 存儲，比如： date,url
Field Keyword(String name, Date value)				不切分 , 存儲 , 索引 , 用來返回 hits
Field.UnIndexed(String name, String value)	No	No	Yes	不切分 , 不索引，存儲，比如：文件路徑
Field.UnStored(String name, String value)	Yes	Yes	No	只全文索引，不存儲
Field UnStored(String name, String value, boolean storeTermVector)	Yes	Yes	No	於上面相同 , 不同的是加入了一個控制變量

總的來看 ,Field 的構造函數就只有四種形式 ,Text,KeyWord,UnIndexed,UnStored, 只不過每種函數往往有多種變形罷了 .

編一段代碼來測試一下 Document 類和 Field 類

public class TestDocument

{

private Document makeDocumentWithFields() throws IOException

{

Document doc = new Document();

doc.add(Field.Text("title","title"));

doc.add(Field.Text("subject","ubject"));

doc.add(Field.Keyword("date","2005.11.12"));

doc.add(Field.Keyword("url","www.tju.edu.cn"));

doc.add(Field.UnIndexed("filepath","D://Lucene"));

doc.add(Field.UnStored("unstored","This field is unstored"));

Field field;

for(int i=0;i<doc.fields.size();i++)

{

field =(Field)doc.fields.get(i);

System.out.println(field.toString());

System.out.println(" 對應的六元組形式爲 ");

System.out.println(field.toString2());

}

return doc;

}

public void GetValuesForIndexedDocument() throws IOException

{

RAMDirectory dir = new RAMDirectory();

IndexWriter writer = new IndexWriter(dir,new StandardAnalyzer(),true);

writer.addDocument(makeDocumentWithFields());

writer.close();

Searcher searcher = new IndexSearcher(dir);

Query query = new TermQuery(new Term("title","title"));

//Hits 由匹配的 Document 組成 .

Hits hits = searcher.search(query);

System.out.println("Document 的結構形式 ");

System.out.println(hits.doc(0));

<spa>

Lucene與搜索引擎技術(Document包詳解）

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

linux下C遍歷文件夾

對話框

非模態對話框的使用

Java異常處理一般性原則

Lucene與搜索引擎技術(Document包詳解）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結