【爬蟲】Nutch1.15 & Solr 8.2.0 配置

Nutch

1.x vs 2.x

Nutch development has been focused mainly on 1.x for the last few years. 2.x was designed with Apache Gora, so you can have a wide variety of storage backends, which is very useful! But it hasn’t been stable for quite a few years, although it has improved very much! Unfortunately, many new features that are about crawling have been added to 1.x, not 2.x because it would take so much time for us to maintain two branches. Nutch 1.x has many bug fixes and many new features and plugins. Nutch 2.x has awesome flexible storage, but lacks many features that have been added to 1.x. On the bright side, a good number of cool features in 1.x are plugins which are easier to port ?

環境

CentOS 7.6
JDK 1.8
Nutch 1.15
Solr 8.2.0

安裝過程

JAVA_HOME

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export CLASSPATH=.:JAVAHOME/jre/lib/rt.jar:JAVA_HOME/jre/lib/rt.jar:JAVA_HOME/lib/dt.jar:JAVAHOME/lib/tools.jarexportPATH=JAVA_HOME/lib/tools.jar export PATH=PATH:$JAVA_HOME/bin

下載nutch & solr

# 此處下載都是二進制版本
wget https://mirrors.tuna.tsinghua.edu.cn/apache/nutch/1.15/apache-nutch-1.15-bin.tar.gz
wget https://mirrors.tuna.tsinghua.edu.cn/apache/lucene/solr/8.2.0/solr-8.2.0.tgz

Nutch配置

爬蟲的配置在conf/nutch-default.xml中有註釋,在conf/nutch-site.xml中用戶自定義配置。

<configuration>
	<property>
	 <name>http.agent.name</name>
	 <value>My Nutch Spider</value>
	</property>

	<property>
	  <name>plugin.includes</name>
		  <value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
		  <description>Regular expression naming plugin directory names to
		  include.  Any plugin not matching this expression is excluded.
		  In any case you need at least include the nutch-extensionpoints plugin. By
		  default Nutch includes crawling just HTML and plain text via HTTP,
		  and basic indexing and search plugins. In order to use HTTPS please enable
		  protocol-httpclient, but be aware of possible intermittent problems with the
		  underlying commons-httpclient library. Set parsefilter-naivebayes for classification based focused crawler.
		  </description>
	</property>
</configuration>

爬取url在urls/seeds.txt

正則過濾器的配置在conf/regex-urlfilter.txt
ex:
+^https?://([a-z0-9-]+\.)*nutch\.apache\.org/

流程:

inject -> generate -> fetch -> parse -> updatedb -> invertlinks -> indexing

  • inject: 將目標url注入爬蟲
  • generate: 生成crawldb & segment
  • fetch: 獲取信息
  • parse: 分析結果
  • updatedb: 更新db
  • invertlinks: 倒排索引…?
  • indexing: 關聯到solr
  • dedup: 清除重複項
  • clean: 維護索引,清除404網頁

ex:

bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments -topN 1000

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1

bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1

bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

bin/nutch dedup <crawldb> [-group <none|host|domain>] [-compareOrder <score>,<fetchTime>,<urlLength>]
bin/nutch clean crawl/crawldb/ http://localhost:8983/solr

Solr配置

注:最初在/usr/local目錄下,官方不建議使用root用戶啓動solr服務,普通用戶在此目錄下一般權限不夠,導致以普通用戶啓動時產生了奇奇怪怪的問題(新建solr core失敗,chmod之後還是不行),最終轉移到home目錄下,問題解決。
另外還有一個坑:

2018-10-15 12:55:24.484 ERROR (qtp102617125-17) [ x:nutch] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: fieldType 'pdates' not found in the schema

這是solr和nutch版本不兼容導致的問題(新版本沒有pdates一系列的配置了),需要在solr底下關於nutch的schema.xml文件中對應fieldType標籤加入以下代碼:

<fieldType name="pdate" class="solr.DatePointField" docValues="true"/>
<fieldType name="pdates" class="solr.DatePointField" docValues="true" multiValued="true"/>
<fieldType name="pint" class="solr.IntPointField" docValues="true"/>
<fieldType name="pfloat" class="solr.FloatPointField" docValues="true"/>
<fieldType name="plong" class="solr.LongPointField" docValues="true"/>
<fieldType name="pdouble" class="solr.DoublePointField" docValues="true"/>
<fieldType name="pints" class="solr.IntPointField" docValues="true" multiValued="true"/>
<fieldType name="pfloats" class="solr.FloatPointField" docValues="true" multiValued="true"/>
<fieldType name="plongs" class="solr.LongPointField" docValues="true" multiValued="true"/>
<fieldType name="pdoubles" class="solr.DoublePointField" docValues="true" multiValued="true"/>
<fieldType name="random" class="solr.RandomSortField" indexed="true"/>
mkdir -p ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/
cp -r ${APACHE_SOLR_HOME}/server/solr/configsets/_default/* ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/
rm ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schema
${APACHE_SOLR_HOME}/bin/solr start
# 就是在下面這步報錯
${APACHE_SOLR_HOME}/bin/solr create -c nutch -d ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/

創建網易163門戶 core

create -c netease -d ${APACHE_SOLR_HOME}/server/solr/configsets/netease/conf/


bin/nutch inject data/netease/crawldb urls
bin/nutch generate data/netease/crawldb data/netease/segments -topN 1000
s2=`ls -d data/netease/segments/2* | tail -1`
echo $s2

bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb data/netease/crawldb $s2

bin/nutch invertlinks data/netease/linkdb -dir data/netease/segments
bin/nutch solrindex http://127.0.0.1:8983/solr/ data/netease/crawldb -linkdb data/netease/linkdb data/netease/segments/*

這個暫時沒有抓取到數據。

Questions

2019-09-18 19:08:11,181 ERROR fetcher.Fetcher - Fetcher: No agents listed in 'http.agent.name' property.
2019-09-18 19:08:11,194 ERROR fetcher.Fetcher - Fetcher: java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
	at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:561)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:429)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:543)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:517)

I have two XML files, nutch-default.xml and nutch-site.xml, why?
nutch-default.xml is the out of the box configuration for Nutch, and most configurations can (and should unless you know what your doing) stay as per. nutch-site.xml is where you make the changes that override the default settings.

agent增加name,value

<configuration>
	<property>
	<name>http.agent.name</name>
	<value>my nutch spider</value>
	</property>
</configuration>
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章