lucene-相關概念與定義

 

這裏主要描述了一些Lucene的相關概念和定義

 

定義

 

Analyzer  - 用於在分析文本,英語和拉丁語系通常用StandardAnalyzer 。編制索引的文本Lucene的類。大多數應用程序可以使用英語和拉丁語的語言StandardAnalyzer。

 

Payloads(有效載荷) - payload 是一個字節數組(array of bytes),用於存儲term的位置。

 

Snowball Stemmers(雪球詞幹分析器 ) --Snowball Stemmers是lucene引入的詞幹分析器之一。 更多信息請參看 nowball website 。

 

Stemmer (詞幹分析器 - 以下解釋來自於維基:“這種算法用來降低干擾詞、同義詞的影響……,以用於降低索引大小……” 。這一段原文如下:

 

 

"A stemming algorithm, or stemmer, is a computer program or algorithm for reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form." Stemmers are often used to reduce the search space and index size. Often times a user searching for "widgets" is interested in documents that contain the term "widget".
 

 

 

核心類

 

Document

A Lucene Document is a record in the index. A Document has a list of fields; each field has a name and a textual value.

 

Term

Term is Lucene's unit of indexing. In western languages, a Term is often a word.

 

TermEnum

TermEnum 通常用於統計某個field中的term個數,但不考慮這些term出現在哪個document中。

 

一些查詢子類就是通過對比terms 來實現查詢的,例如: WildcardQuery,PrefixQueryRangeQuery.

 

 

 

原文

TermEnum is used to enumerate all terms in the index for a given field, regardless of which documents the terms occur in (or where they occur).

Some query subclasses are implemented by enumerating terms that match a pattern, and building a large OR query from the enumeration. E.g. WildcardQuery,PrefixQueryRangeQuery.

See LuceneFAQHow do I retrieve all the values of a particular field that exists within an index, across all documents? which also includes sample code.

 

 

 

TermDocs

不像TermEnum (see above), TermDocs 通常用於確定哪些文檔包括給定的Term。另外,TermDocs 也提供了term 在文檔中出現的頻率。

 

TermFreqVector

TermFreqVector (aka Term Frequency Vector or just Term Vector) is a data structure containing a given Document's term and frequency information and can be retrieved from the IndexReader only when Term Vectors are stored during indexing.

 

TermFreqVector 是一個包含 given Document's term 和**的數據結構。

 

 原文
TermFreqVector (aka Term Frequency Vector or just Term Vector) is a data structure containing a given Document's term and frequency information and can be retrieved from the IndexReader only when Term Vectors are stored during indexing.
 

Directory

 

IndexReader

 

IndexSearcher

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章