全文檢索引擎Solr系列——Solr核心概念、配置文件

Document

Document是Solr索引（動詞，indexing）和搜索的最基本單元，它類似於關係數據庫表中的一條記錄，可以包含一個或多個字段（Field），每個字段包含一個name和文本值。字段在被索引的同時可以存儲在索引中，搜索時就能返回該字段的值，通常文檔都應該包含一個能唯一表示該文檔的id字段。例如：

<doc>

    <field
name="id">company123</field>

    <field
name="companycity">Atlanta</field>

    <field
name="companystate">Georgia</field>

    <field
name="companyname">Code
Monkeys R Us, LLC</field>

    <field
name="companydescription">we
write lots of code</field>

    <field
name="lastmodified">2013-06-01T15:26:37Z</field>

</doc>

Schema

Solr中的Schema類似於關係數據庫中的表結構，它以schema.xml的文本形式存在在conf目錄下，在添加文當到索引中時需要指定Schema，Schema文件主要包含三部分：字段（Field）、字段類型（FieldType）、唯一鍵（uniqueKey）

字段類型（FieldType）：用來定義添加到索引中的xml文件字段（Field）中的類型，如：int，String，date，
字段（Field）：添加到索引文件中時的字段名稱
唯一鍵（uniqueKey）：uniqueKey是用來標識文檔唯一性的一個字段（Feild），在更新和刪除時用到

例如：

<schema
name="example"

version="1.5">

    <field
name="id"

type="string"

indexed="true"

stored="true"

required="true"

multiValued="false"

/>

    <field
name="title"

type="text_general"

indexed="true"

stored="true"

multiValued="true"/>

    <uniqueKey>id</uniqueKey>

    <fieldType
name="string"

class="solr.StrField"

sortMissingLast="true"

/>

    <fieldType
name="text_general"

class="solr.TextField"

positionIncrementGap="100">

          <analyzer
type="index">

            <tokenizer
class="solr.StandardTokenizerFactory"/>

            <filter
class="solr.StopFilterFactory"

ignoreCase="true"

words="stopwords.txt"

/>

            <!--
in this

example, we will only use synonyms at query time

            <filter
class="solr.SynonymFilterFactory"

synonyms="index_synonyms.txt"

ignoreCase="true"

expand="false"/>

            -->

            <filter
class="solr.LowerCaseFilterFactory"/>

          </analyzer>

          <analyzer
type="query">

            <tokenizer
class="solr.StandardTokenizerFactory"/>

            <filter
class="solr.StopFilterFactory"

ignoreCase="true"

words="stopwords.txt"

/>

            <filter
class="solr.SynonymFilterFactory"

synonyms="synonyms.txt"

ignoreCase="true"

expand="true"/>

            <filter
class="solr.LowerCaseFilterFactory"/>

          </analyzer>

    </fieldType>

</schema>

Field

在Solr中，字段(Field)是構成Document的基本單元。對應於數據庫表中的某一列。字段是包括了名稱，類型以及對字段對應的值如何處理的一種元數據。比如：

<field name="name" type="text_general" indexed="true" stored="true"/>

Indexed：Indexed=true時，表示字段會加被Sorl處理加入到索引中，只有被索引的字段才能被搜索到。
Stored：Stored=true，字段值會以保存一份原始內容在在索引中，可以被搜索組件組件返回，考慮到性能問題，對於長文本就不適合存儲在索引中。

Field Type

Solr中每個字段都有一個對應的字段類型，比如：float、long、double、date、text，Solr提供了豐富字段類型，同時，我們還可以自定義適合自己的數據類型，例如：

<!--
Ik 分詞器 --> 

 <fieldType
name="text_cn_stopword"

class="solr.TextField">

     <analyzer
type="index">

         <tokenizer
class="org.wltea.analyzer.lucene.IKAnalyzerSolrFactory"

useSmart="false"/>

     </analyzer>

     <analyzer
type="query">

         <tokenizer
class="org.wltea.analyzer.lucene.IKAnalyzerSolrFactory"

useSmart="true"/>

     </analyzer>

 </fieldType>

 <!--
Ik 分詞器 -->

Solrconfig：

如果把Schema定義爲Solr的Model的話，那麼Solrconfig就是Solr的Configuration，它定義Solr如果處理索引、高亮、搜索等很多請求，同時還指定了緩存策略，用的比較多的元素包括：

指定索引數據路徑

<!--

Used
to specify an alternate directory to hold all index data

other
than the default

./data under the Solr home.

If
replication is in use, this

should match the replication configuration. 

-->

<dataDir>${solr.data.dir:./solr/data}</dataDir>

緩存參數

<filterCache

  class="solr.FastLRUCache"

  size="512"

  initialSize="512"

  autowarmCount="0"/>

<!--
queryResultCache caches results of searches - ordered lists of

     document
ids (DocList) based on a query, a sort, and the range

     of
documents requested.  -->

 <queryResultCache

  class="solr.LRUCache"

  size="512"

  initialSize="512"

  autowarmCount="0"/>

 <!--
documentCache caches Lucene Document objects (the stored fields for

each document).

   Since
Lucene internal document ids are transient,
this

cache will not be autowarmed.  -->

 <documentCache

  class="solr.LRUCache"

  size="512"

  initialSize="512"

  autowarmCount="0"/>

請求處理器
請求處理器用於接收HTTP請求，處理搜索後，返回響應結果的處理器。比如：query請求：

<!--
A request handler that returns indented JSON by default

-->

<requestHandler
name="/query"

class="solr.SearchHandler">

     <lst
name="defaults">

       <str
name="echoParams">explicit</str>

       <str
name="wt">json</str>

       <str
name="indent">true</str>

       <str
name="df">text</str>

     </lst>

</requestHandler>

每個請求處理器包括一系列可配置的搜索參數，例如：wt,indent,df等等。

l192168134

發佈了30 篇原創文章 · 獲贊 7 · 訪問量 10萬+

私信關注

全文檢索引擎Solr系列——Solr核心概念、配置文件

Document

Schema

Field

Field Type

Solrconfig：

2Solr實現全文搜索

搭建高可用MongoDB集羣（四）：分片

hadoop2.7.2分佈式集羣搭建和生態系統配置

全文檢索引擎Solr系列——Solr核心概念、配置文件

搭建高可用MongoDB集羣（二）：副本集

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結