Usage with XML/HTTP Datasource

使用xml/http數據源

數據導入處理程序（DataImportHandler）可用於索引基於HTTP的數據源中的數據。這包括使用來自REST/XML API以及RSS/Atom提要（Feeds）的索引。

Configuration of URLDataSource or HttpDataSource

在solr1.4中，HttpDataSource正被棄用，取而代之的是URLDataSource 。

Sample configurations for URLDataSource <!> Solr1.4 and HttpDataSource in data config xml look like this：

此示例的數據配置如下所示：

特定於此數據源的額外屬性是：

baseUrl ：

（可選）：當主機/端口在dev/qa/prod環境之間發生更改時，應該使用它。使用此屬性可以隔離對solrconfig.xml所做的更改。

encoding：

編碼（可選）：默認情況下，使用響應頭中的編碼。可以使用此屬性覆蓋默認編碼。

connectionTimeout ：

（可選）：默認值爲5000ms

readTimeout ：

（可選）：默認值爲10000ms

配置data-config.xml

XML/HTTP數據源的實體（entity ）可以在默認屬性之上具有以下屬性：

processor （必需）：值必須是XPathEntityProcessor。

url （必需）：URL用於調用RESTAPI。（可以模板化）。如果數據源是文件，則必須是文件位置。

stream （可選）：如果XML非常大，則將其設置爲true。

forEach（必需）：定義記錄的xpath表達式。如果有多種類型的記錄，用“|”（管道）將其分開。如果useSolrAddSchema 設置爲“true”，則可以省略。

xsl（可選）：這將用作應用XSL轉換的預處理器。提供文件系統或URL中的完整路徑。

useSolrAddSchema（可選）：如果輸入此處理器的XML與solr add xml的架構相同，則將其值設置爲“true”。如果設置爲true，則無需提及任何字段。

flatten （可選）：如果設置爲true，則無論標記名稱如何，所有標記下的文本都將提取到一個字段中。SORR1.4。

實體字段可以具有以下屬性（在默認屬性之上）：

xpath （可選）：要作爲記錄中的列映射的字段的xpath表達式。如果列不是來自XML屬性（由轉換器創建的合成字段），則可以省略此項。如果在架構中將字段標記爲多值，並且在給定的xpath行中找到多個值，則XPathEntityProcessor將自動處理該值。不需要額外配置

commonField：可以是（ture|false）。如果爲真，則在創建SOLR文檔之前，此字段一旦在記錄中遇到，將被複制到其他記錄中。

如果一個API支持分塊（當數據集太大時），則需要多次調用才能完成該過程。XPathEntityprocessor 通過一個轉換器（transformer）來支持這一點。如果transformer返回一行，其中包含字段$hasmore的值爲“true”，則處理器將使用相同的URL模板發出另一個請求（實際值在調用前重新計算）。通過返回包含字段$next url的行，轉換器也可以爲下一個調用傳遞一個全新的url，該字段的值必須是下一個調用的完整url。

XPathEntityprocessor 實現了支持xpath語法子集的流式分析器。不支持完整的xpath語法，但大多數常見的用例包括以下內容：

xpath="/a/b/subject[@qualifier='fullTitle']"

xpath="/a/b/subject/@qualifier"

xpath="/a/b/c"

xpath="//a/..."

xpath="/a//b..."

HttpDataSource的例子

在solr1.4中，httpdatasource正被棄用，取而代之的是urldatasource。

下載db部分中給出的完整導入示例以進行嘗試。我們將嘗試爲這個示例的Slashdot RSS源建立索引。

此示例的數據配置如下所示：

<dataConfig>

        <dataSource type="HttpDataSource" />

        <document>

                <entity name="slashdot"

                        pk="link"

                        url="http://rss.slashdot.org/Slashdot/slashdot"

                        processor="XPathEntityProcessor"

                        forEach="/RDF/channel | /RDF/item"

                        transformer="DateFormatTransformer">



                        <field column="source"       xpath="/RDF/channel/title"   commonField="true" />

                        <field column="source-link"  xpath="/RDF/channel/link"    commonField="true" />

                        <field column="subject"      xpath="/RDF/channel/subject" commonField="true" />



                        <field column="title"        xpath="/RDF/item/title" />

                        <field column="link"         xpath="/RDF/item/link" />

                        <field column="description"  xpath="/RDF/item/description" />

                        <field column="creator"      xpath="/RDF/item/creator" />

                        <field column="item-subject" xpath="/RDF/item/subject" />



                        <field column="slash-department" xpath="/RDF/item/department" />

                        <field column="slash-section"    xpath="/RDF/item/section" />

                        <field column="slash-comments"   xpath="/RDF/item/comments" />

                        <field column="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />

                </entity>

        </document>

</dataConfig>

此數據配置是操作所在的位置。如果您閱讀Slashdot RSS的結構，它有一些標題元素，如標題（title）、鏈接（link）和主題（subject）。它們分別使用xpath語法映射到solr字段source、source link和subject。提要（feeds）還包含多個包含實際新聞項的項元素（item）。所以，我們要做的是，在solr中爲每個“item”創建一個文檔。

XPathEntityprocessor 設計爲逐行傳輸XML（將行視爲XML元素中的各個字段）。他用forEach 屬性一行(row),在本例中，forEach的值爲'/RDF/channel | /RDF/item'。這表示此XML有兩種類型的行。（這將使用的xpath語法，並且可以有多種類型的行。）在遇到一行之後，它將嘗試讀取字段聲明中的儘可能多的字段。因此，在這種情況下，當它讀取行'/RDF/channel'時，可能會得到3個字段'source'、'source link'、'source subject'。在處理該行之後，它意識到它沒有“pk”字段的任何值，因此它不會嘗試爲此行創建solr文檔（即使它嘗試了，也可能在solr中失敗）。但這三個字段都標記爲commonField=“true”。因此，它可以爲後續的行保留值。

它向前移動並遇到/rdf/item並逐個處理行。它獲取除前面的3個字段之外的所有字段的值。但是，由於它們被標記爲公共字段，處理程序會在創建文檔之前將這些字段放入記錄中。

實體中的transformer=DateFormatTransformerr屬性如何？您可以使用此功能從RESTAPI索引，例如RSS/Atom提要、XML數據提要、其他Solr服務器，甚至是格式良好的XHTML文檔。我們的xpath支持有其侷限性（沒有通配符，只有fullpath等），但我們已經嘗試確保覆蓋了常見的用例，因爲它基於流解析器。它非常快，即使對於大型XML，也會消耗恆定的內存量。它不支持名稱空間，但可以使用名稱空間處理XML。當您提供xpath時，只需刪除名稱空間並給出其餘部分（例如，如果標記爲“<dc:subject>”，則映射應只包含“subject”）。很簡單，不是嗎？你不需要寫一行代碼！享受

注意：與數據庫不同，如果使用XPathEntityprocessor ，則不可能省略字段聲明。它依賴於字段中聲明的XPath來標識從XML中提取什麼。

示例：索引維基百科

下面的data-config.xml用於索引一個完整的（en文章，僅限最新）維基百科轉儲。從維基百科下載的文件是pages-articles.xml.bz2，在未壓縮時，磁盤上的容量大約爲40GB。

<dataConfig>

        <dataSource type="FileDataSource" encoding="UTF-8" />

        <document>

        <entity name="page"

                processor="XPathEntityProcessor"

                stream="true"

                forEach="/mediawiki/page/"

                url="/data/enwiki-20130102-pages-articles.xml"

                transformer="RegexTransformer,DateFormatTransformer"

                >

            <field column="id"        xpath="/mediawiki/page/id" />

            <field column="title"     xpath="/mediawiki/page/title" />

            <field column="revision"  xpath="/mediawiki/page/revision/id" />

            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />

            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />

            <field column="text"      xpath="/mediawiki/page/revision/text" />

            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />

            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>

       </entity>

        </document>

</dataConfig>

schema.xml的相關部分如下：

<field name="id"        type="string"  indexed="true" stored="true" required="true"/>

<field name="title"     type="string"  indexed="true" stored="false"/>

<field name="revision"  type="int"    indexed="true" stored="true"/>

<field name="user"      type="string"  indexed="true" stored="true"/>

<field name="userId"    type="int"     indexed="true" stored="true"/>

<field name="text"      type="text_en"    indexed="true" stored="false"/>

<field name="timestamp" type="date"    indexed="true" stored="true"/>

<field name="titleText" type="text_en"    indexed="true" stored="true"/>

...

<uniqueKey>id</uniqueKey>

<copyField source="title" dest="titleText"/>

索引8338182篇文章所用的時間約爲50分鐘，峯值內存使用量約爲4GB。此測試是在SOLR 4.3.1版本中完成的，RamBufferSizemb設置爲256MB。維基百科的轉儲文件在希捷7200RPM硬盤上，Solr索引文件在海盜船的強制GT固態硬盤上。

請注意，許多維基百科文章只是重定向到其他文章，使用$skipdoc<！>solr1.4允許忽略這些文章。另外，只有當regexp匹配時才定義列$skipdoc。

備註：

<dataConfig>

<propertyWriter dateFormat="yyyy-MM-dd HH:mm:ss" type="SimplePropertiesWriter"  filename="my_dih.properties" locale="zh-CN" />

  <dataSource name="jdbcDS" type="JdbcDataSource"

              driver="oracle.jdbc.driver.OracleDriver"

              url="jdbc:oracle:thin:xxxx"

              user="xx"

              password="U2FsdGVkX1/PqBuNUFBIcmLKTb+y41YB6J7b6tAm8Xw="

              encryptKeyFile="/var/solr/data/dih-encryptionkey"

              />

<dataSource name="urlDS" type="URLDataSource"/>

  <document>

 <!--   <entity name="id"

            query="select id,name,section,subject from CLASS_TYPE">

        <field column="ID" name="id"/>

       <field column="NAME" name="solr_name"/>

       <field column="SECTION" name="solr_section"/>

       <field column="SUBJECT" name="subject_s"/>  

  </entity>-->

<!--   <entity name="info" transformer="DateFormatTransformer"  query="select R_CODE,R_TITLE,R_KS_ID,R_DESC,R_TYPECODE,R_FORMAT,SOLR_LAST_DATE FROM RMS_RESOURCEINFO" deltaImportQuery="select R_CODE,R_TITLE,R_KS_ID,R_DESC,R_TYPECODE,R_FORMAT,SOLR_LAST_DATE FROM RMS_RESOURCEINFO where R_CODE='${dataimporter.delta.R_CODE}'" deltaQuery="SELECT R_CODE FROM RMS_RESOURCEINFO WHERE SOLR_LAST_DATE>TO_DATE('${dataimporter.last_index_time}','yyyy-mm-dd hh24:mi:ss')">

       <field column="R_CODE" name="id"/>

       <field column="R_TITLE" name="rtitle_txt_cjk"/>

       <field column="R_KS_ID" name="ksid_s"/>

       <field column="R_DESC" name="rdesc_txt_cjk"/>

       <field column="R_TYPECODE" name="rtypecode_s"/>

       <field column="R_FORMAT" name="rformat_s"/>

       <field column="SOLR_LAST_DATE" dateTimeFormat="yyyy-MM-dd HH:mm:ss"  name="lastDate_dt"/>

    </entity>-->

     <entity name="slashdot"

                        pk="link"

                        url="http://192.168.119.10/interfaces/slashdot.xml"

                        processor="XPathEntityProcessor"

                        forEach="/RDF/channel | /RDF/item"

                        transformer="DateFormatTransformer">



                        <field column="source" name="source_txt_cjk"       xpath="/RDF/channel/title"   commonField="true" />

                        <field column="source-link_s"  xpath="/RDF/channel/link"    commonField="true" />

                        <field column="subject_txt_cjk"      xpath="/RDF/channel/subject" commonField="true" />



                        <field column="title_txt_cjk"        xpath="/RDF/item/title" />

                        <field column="link"   name="id"      xpath="/RDF/item/link" />

                        <field column="description_txt_cjk"  xpath="/RDF/item/description" />

                        <field column="creator_s"      xpath="/RDF/item/creator" />

                        <field column="item-subject__txt_cjk" xpath="/RDF/item/subject" />



                        <field column="slash-department_s" xpath="/RDF/item/department" />

                        <field column="slash-section_s"    xpath="/RDF/item/section" />

                        <field column="slash-comments_s"   xpath="/RDF/item/comments" />

                        <field column="date_dt" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />

                </entity>

  </document>

</dataConfig>

Usage with XML/HTTP Datasource

Usage with XML/HTTP Datasource

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

緩存抽象

jvm運行時保留相關信息總結

spring boot 包結構

FreeSpace 和 UsableSpace

Springboot 中使用Swagger2構建RESTful APIs

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結