nutch + solr —— 搭建初探

原創

kradnangel

2020-06-16 16:56

一. 環境：

apache-nutch-1.8

solr-4.7.0

二. nutch配置提示：
1. 配置 nutch-site.xml

<property>
  <name>http.agent.name</name>
  <value>MySpider</value>
</property>
 
<property>
  <name>http.robots.agents</name>
  <value>MySpider,*</value>
</property>

http.agent.name 必填

http.robots.agents 選填，若不填，fetch開始時會提示，但不影響運行

2. 修改bin/nutch 權限

chmod +x bin/nutch

3. deploy 目錄只有以部署二進制包的形式安裝nutch，纔會出現

4. regex-urlfilter.txt 分析

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

- 表示過濾掉滿足後面正則表達式的urls

^表示開頭

file|ftp|mailto 表示要過濾掉開頭是file，ftp，mailto的urls

三. solr配置提示：

cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/
vim ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
Copy exactly in 351 line: <field name="_version_" type="long" indexed="true" stored="true"/> (collection1/conf 中已有此項）
restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example

master上配置好nutch及solr後，移植到其他slave上。

1. scp 兩個文件夾過去

2. scp /etc/profile

3. source /etc/profile

4. 注意，若hadoop已經移植過去，那麼若之後master上配置有改，記得同步到其他slave上。

5. 注意權限問題。具體要給那些加上權限...不太記得了。可能用到的都附上權限吧。。

四. 排錯：

1. 運行nutch提示後退出：Generator: 0 records selected for fetching, exiting ...

bin/nutch readdb data/crawldb -stats

可以查看CrawlDB的信息

hadoop@master:~/apache-nutch-1.8$ bin/nutch readdb data/crawldb -stats
CrawlDb statistics start: data/crawldb
Statistics for CrawlDb: data/crawldb
TOTAL urls:	1
retry 1:	1
min score:	1.0
avg score:	1.0
max score:	1.0
status 1 (db_unfetched):	1
CrawlDb statistics: done

db_unfetched 爲1，但是retry也爲1，所以是抓取失敗了。

我的辦法是，把data刪了，重新執行一遍，成功。

2. Job failed

command

bin/crawl urls/seed.txt data http://localhost:8983/solr/ 2

error

Indexer: java.io.IOException: Job failed!
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
	at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
	at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)

solr log

org.apache.solr.common.SolrException: ERROR: [doc=http://www.163.com/] unknown field 'host'

nutch log

org.apache.solr.common.SolrException: Bad Request

Bad Request

request: http://localhost:8080/solr/update?wt=javabin&version=2

解決

修改 ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml

在<fields> 與 <fields> 中間增加

<field name="host" type="string" stored="false" indexed="true"/> 
<field name="digest" type="string" stored="true" indexed="false"/> 
<field name="segment" type="string" stored="true" indexed="false"/> 
<field name="boost" type="float" stored="true" indexed="false"/> 
<field name="tstamp" type="date" stored="true" indexed="false"/>

效果

檢索抓取到的內容，用瀏覽器打開 http://localhost:8983/solr/#/collection1/query

參考

http://stackoverflow.com/questions/13429481/error-while-indexing-in-solr-data-crawled-by-nutch

http://blog.csdn.net/panjunbiao/article/details/12171147

3. Warning: $HADOOP_HOME is deprecated.

在/etc/profile中加

export HADOOP_HOME_WARN_SUPPRESS=1

4. nutch在hadoop上跑時出錯：

14/04/08 12:28:07 INFO mapred.JobClient:     Map output records=1
14/04/08 12:28:07 INFO crawl.Generator: Generator: finished at 2014-04-08 12:28:07, elapsed: 00:00:51
ls: 無法訪問data/segments/: 沒有那個文件或目錄
Operating on segment : 
Fetching :

解決：見我另一篇博文：

nutch on hadoop 遇到 ls: 無法訪問data/segments: 沒有那個文件或目錄

五. 待解決：

1. 壓力測試

Solr集羣Replication配置與實踐

http://blog.csdn.net/shirdrn/article/details/7055355

2. anchor

org.apache.solr.common.SolrException: ERROR: [doc=http://baike.baidu.com/] unknown field 'anchor'

懷疑是中文分詞之類的問題

http://www.163.com/ 沒問題

http://www.baidu.com http://www.renren.com 有問題

http://nutch.apache.org/ http://hadoop.apache.org/ http://lucene.apache.org/ 沒問題

3. Generator:0 records selected for fetching,

懷疑是url數量不夠，增加url之後，問題解決。

六. 附錄：

Nutch學習筆記二——抓取過程簡析

Nutch+Hadoop集羣搭建

Nutch-hadoop集羣配置——Ubuntu10.04

觀察nutchcrawl的每一步

國內首套免費的《Nutch相關框架視頻教程》(1-16)

Solr配置文件：schema.xml

深入Solr實戰

Lucene/ Solr開發經驗

NutchTutorial

Hadoop Shell命令

DataNode節點上數據塊的完整性——DataBlockScanner

Solr調研總結

Hadoopp的日誌

Nutch的命令詳解

Nutch plugin

提高nutch爬取效率

Nutch 插件系統淺析

網絡爬蟲調研報告

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

nutch + solr —— 搭建初探

1. 運行nutch提示後退出：Generator: 0 records selected for fetching, exiting ...

觀察nutchcrawl的每一步

Solr配置文件：schema.xml

深入Solr實戰

Lucene/ Solr開發經驗

NutchTutorial

Hadoop Shell命令

hadoop nutch solr 環境搭建手冊

Solr調研總結

Nutch的命令詳解

Nutch 插件系統淺析

linux安裝cuda和cudnn

Mellanox網卡開啓SR-IOV

全面系統的AI學習路徑，幫助普通人也能玩轉AI

模擬手機設備：使用 Playwright 實現移動端自動化測試

HTML 00 Tutorial

uni-app實現上拉加載

vue3編譯優化之“靜態提升”

又是一個月-20240513

flask 如何保證返回json有序

linux服務器設置ssh免密

[題解] ZOJ1002 -- Fire Net

nutch on hadoop 遇到 ls: 無法訪問data/segments: 沒有那個文件或目錄

nutch + solr —— 搭建初探

nutch + tomcat7 ——探索中

彙編與接口——串口通信篇

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結