搜索引擎Nutch1.4+solr1.4整合(成功)

原創

j2mvc

2020-06-11 18:10

Nutch1.4+solr1.4整合

apache-nutch-1.4-bin.tar.gz

apache-solr-4.0-2012-05-08_08-10-03.tgz

此測試環境CENTOS6.2

1.解壓Nutch1.4和solr1.4，（可選：配置NUTCH_HOME和SOLR_HOME)
2.修改$NUTCH_HOME/runtime/local/conf/regex-urlfilter.txt最後一行爲
+^http://www.163.com/
3.在$NUTCH_HOME/runtime/local/conf/nutch-site.xml內添加
<property>
        <name>http.agent.name</name>
        <value>My Nutch Agent</value>
        <description>HTTP 'User-Agent' request header. MUST NOT be empty -
          please set this to a single word uniquely related to your organization.
          NOTE: You should also check other related properties:
                http.robots.agents
                http.agent.description
                http.agent.url
                http.agent.email
                http.agent.version
                and set their values appropriately.
        </description>
</property>
<property>
        <name>http.agent.description</name>
        <value></value>
        <description>Further description of our bot- this text is used in
        the User-Agent header. It appears in parenthesis after the agent name.
        </description>
</property>
<property>
        <name>http.agent.url</name>
        <value></value>
        <description>A URL to advertise in the User-Agent header. This will
        appear in parenthesis after the agent name. Custom dictates that this
        should be a URL of a page explaining the purpose and behavior of this crawler.
        </description>
</property>
<property>
        <name>http.agent.email</name>
        <value></value>
        <description>An email address to advertise in the HTTP 'From' request
        header and User-Agent header. A good practice is to mangle this
        address (e.g. 'info at example dot com') to avoid spamming.
        </description>
</property>
4.在$NUTCH_HOME/runtime/local下創建
urls/seed.txt
添加內容
http://www.163.com
5.將$NUTCH_HOME/runtime/local/conf/schema-solr4.xml內
<schema></schema>替換
$SOLR_HOME/example/solr/conf/schema.xml

6.在conf目錄下添加stopwords_en.txt.
7.跳轉 $SOLR_HOME/example/目錄,啓動solr
java -jar start.jar
在瀏覽器輸入http://localhost:8983/solr訪問.
8.在$NUTCH_HOME/runtime/local下輸入命令
   bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir crawl -depth 10 -threads 5 -topN 100
   如果已經執行過了，得先刪除crawl目錄：rm -rf crawl
9.等待執行結束.在http://localhost:8983/solr頁面點擊query查詢
在q處輸入content:新聞,即可查詢。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

搜索引擎Nutch1.4+solr1.4整合(成功)

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

Shell/Python中的用戶名獲取

搜索引擎Nutch1.4+solr1.4整合(成功)

支持seo的vue前端項目創建詳細記錄

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結