apache-nutch-1.4-bin.tar.gz
apache-solr-4.0-2012-05-08_08-10-03.tgz
此測試環境CENTOS6.2
1.解壓Nutch1.4和solr1.4,(可選:配置NUTCH_HOME和SOLR_HOME)
2.修改$NUTCH_HOME/runtime/local/conf/regex-urlfilter.txt最後一行爲
+^http://www.163.com/
3.在$NUTCH_HOME/runtime/local/conf/nutch-site.xml內添加
<property>
<name>http.agent.name</name>
<value>My Nutch Agent</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
4.在$NUTCH_HOME/runtime/local下創建
urls/seed.txt
添加內容
http://www.163.com
5.將$NUTCH_HOME/runtime/local/conf/schema-solr4.xml內
<schema></schema>替換
$SOLR_HOME/example/solr/conf/schema.xml
<schema></schema>
7.跳轉 $SOLR_HOME/example/目錄,啓動solr
java -jar start.jar
在瀏覽器輸入http://localhost:8983/solr訪問.
8.在$NUTCH_HOME/runtime/local下輸入命令
bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir crawl -depth 10 -threads 5 -topN 100
如果已經執行過了,得先刪除crawl目錄:rm -rf crawl
9.等待執行結束.在http://localhost:8983/solr頁面點擊query查詢
在q處輸入content:新聞,即可查詢。