搜索引擎Nutch1.4+solr1.4整合(成功)

Nutch1.4+solr1.4整合

apache-nutch-1.4-bin.tar.gz

apache-solr-4.0-2012-05-08_08-10-03.tgz

此測試環境CENTOS6.2


1.解壓Nutch1.4和solr1.4,(可選:配置NUTCH_HOME和SOLR_HOME)
2.修改$NUTCH_HOME/runtime/local/conf/regex-urlfilter.txt最後一行爲
  +^http://www.163.com/
3.在$NUTCH_HOME/runtime/local/conf/nutch-site.xml內添加
  <property>
        <name>http.agent.name</name>
        <value>My Nutch Agent</value>
        <description>HTTP 'User-Agent' request header. MUST NOT be empty -
          please set this to a single word uniquely related to your organization.
          NOTE: You should also check other related properties:
                http.robots.agents
                http.agent.description
                http.agent.url
                http.agent.email
                http.agent.version
                and set their values appropriately.
        </description>
  </property>
  <property>
        <name>http.agent.description</name>
        <value></value>
        <description>Further description of our bot- this text is used in
        the User-Agent header. It appears in parenthesis after the agent name.
        </description>
  </property>
  <property>
        <name>http.agent.url</name>
        <value></value>
        <description>A URL to advertise in the User-Agent header. This will
        appear in parenthesis after the agent name. Custom dictates that this
        should be a URL of a page explaining the purpose and behavior of this crawler.
        </description>
  </property>
  <property>
        <name>http.agent.email</name>
        <value></value>
        <description>An email address to advertise in the HTTP 'From' request
        header and User-Agent header. A good practice is to mangle this
        address (e.g. 'info at example dot com') to avoid spamming.
        </description>
  </property>
4.在$NUTCH_HOME/runtime/local下創建
  urls/seed.txt
  添加內容
  http://www.163.com
5.將$NUTCH_HOME/runtime/local/conf/schema-solr4.xml內
  <schema></schema>替換
  $SOLR_HOME/example/solr/conf/schema.xml

  <schema></schema>


6.在conf目錄下添加stopwords_en.txt.
7.跳轉 $SOLR_HOME/example/目錄,啓動solr
  java -jar start.jar
  在瀏覽器輸入http://localhost:8983/solr訪問.
8.在$NUTCH_HOME/runtime/local下輸入命令
   bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir crawl -depth 10 -threads 5 -topN 100
   如果已經執行過了,得先刪除crawl目錄:rm -rf crawl
9.等待執行結束.在http://localhost:8983/solr頁面點擊query查詢
  在q處輸入content:新聞,即可查詢。

 
 
    


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章