一. 環境:
apache-nutch-1.8
solr-4.7.0
二. nutch配置提示:
1. 配置 nutch-site.xml
<property> <name>http.agent.name</name> <value>MySpider</value> </property> <property> <name>http.robots.agents</name> <value>MySpider,*</value> </property>http.agent.name 必填
http.robots.agents 選填,若不填,fetch開始時會提示,但不影響運行
2. 修改bin/nutch 權限
chmod +x bin/nutch
3. deploy 目錄只有以部署二進制包的形式安裝nutch,纔會出現
4. regex-urlfilter.txt 分析
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
- 表示過濾掉滿足後面正則表達式的urls
^表示開頭
file|ftp|mailto 表示要過濾掉開頭是file,ftp,mailto的urls
三. solr配置提示:
-
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/
- vim ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
-
Copy exactly in 351 line: <field name="_version_" type="long" indexed="true" stored="true"/> (collection1/conf 中已有此項)
-
restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example
master上配置好nutch及solr後,移植到其他slave上。
1. scp 兩個文件夾過去
2. scp /etc/profile
3. source /etc/profile
4. 注意,若hadoop已經移植過去,那麼若之後master上配置有改,記得同步到其他slave上。
5. 注意權限問題。具體要給那些加上權限...不太記得了。可能用到的都附上權限吧。。
四. 排錯:
1. 運行nutch提示後退出:Generator: 0 records selected for fetching, exiting ...
bin/nutch readdb data/crawldb -stats
可以查看CrawlDB的信息
hadoop@master:~/apache-nutch-1.8$ bin/nutch readdb data/crawldb -stats
CrawlDb statistics start: data/crawldb
Statistics for CrawlDb: data/crawldb
TOTAL urls: 1
retry 1: 1
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 1
CrawlDb statistics: done
db_unfetched 爲1,但是retry也爲1,所以是抓取失敗了。
2. Job failed
command
bin/crawl urls/seed.txt data http://localhost:8983/solr/ 2
error
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
solr log
org.apache.solr.common.SolrException: ERROR: [doc=http://www.163.com/] unknown field 'host'
nutch log
org.apache.solr.common.SolrException: Bad Request
Bad Request
request: http://localhost:8080/solr/update?wt=javabin&version=2
解決
修改 ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
在<fields> 與 <fields> 中間增加
<field name="host" type="string" stored="false" indexed="true"/>
<field name="digest" type="string" stored="true" indexed="false"/>
<field name="segment" type="string" stored="true" indexed="false"/>
<field name="boost" type="float" stored="true" indexed="false"/>
<field name="tstamp" type="date" stored="true" indexed="false"/>
效果
檢索抓取到的內容,用瀏覽器打開 http://localhost:8983/solr/#/collection1/query
http://stackoverflow.com/questions/13429481/error-while-indexing-in-solr-data-crawled-by-nutch
http://blog.csdn.net/panjunbiao/article/details/12171147
3. Warning: $HADOOP_HOME is deprecated.
在/etc/profile中加
export HADOOP_HOME_WARN_SUPPRESS=1
4. nutch在hadoop上跑時出錯:
14/04/08 12:28:07 INFO mapred.JobClient: Map output records=1
14/04/08 12:28:07 INFO crawl.Generator: Generator: finished at 2014-04-08 12:28:07, elapsed: 00:00:51
ls: 無法訪問data/segments/: 沒有那個文件或目錄
Operating on segment :
Fetching :
解決:見我另一篇博文:
nutch on hadoop 遇到 ls: 無法訪問data/segments: 沒有那個文件或目錄
五. 待解決:
1. 壓力測試
http://blog.csdn.net/shirdrn/article/details/7055355
2. anchor
org.apache.solr.common.SolrException: ERROR: [doc=http://baike.baidu.com/] unknown field 'anchor'
懷疑是中文分詞之類的問題
http://www.163.com/ 沒問題
http://www.baidu.com http://www.renren.com 有問題
http://nutch.apache.org/ http://hadoop.apache.org/ http://lucene.apache.org/ 沒問題
3. Generator:0 records selected for fetching,
懷疑是url數量不夠,增加url之後,問題解決。
六. 附錄: