官網介紹如下:
Apache Nutch is a highly extensible and scalable open source web crawler software project.
nutch是一個高度可擴展和可伸縮的開源的網絡爬蟲項目
nutch1.x與nutch2.x的區別:
storage is abstracted away from any specific underlying data store by using Apache
Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number
of NoSQL storage solutions.
存儲使用Apache Gora來處理對象持久化映射,從而使nutch的存儲脫離了任何底層數據存儲
先查看linux環境,如下:
現在就nutch2.2.1 + mysql5.6部署如下:
1 安裝 mysql5.6 (略)
安裝完成mysql之後對my.cnf進行如下修改
2 進入mysql 創建表
CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` longtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`prevModifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
3 nutch2.2安裝
3.1 下載nutch2.2源碼
3.2 下載ant
3.3 進入nutch2.2目錄 如圖(注:在編譯之前沒有 /runtime 目錄):
3.4 執行:ant命令進行編譯,編譯之後會發現目錄中增加了一個 runtime目錄 該目錄下會有兩個目錄deploy和local 分別表示:部署模式和本地模式
deploy和local目錄結構如下:
3.5 我們進入local目錄 進行本地部署
進入conf目錄:
在groa.properties文件中修改mysql的數據庫連接如下:
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://127.0.0.1:3306/nutch
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=root
在nutch-site.xml文件中<configuration></configuration>中增加屬性如下:
<property>
<name>http.agent.name</name>
<value>nutch01</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.accept.language</name>
<value>en-us,en-gb,en,zh-cn,zh-tw;q=0.7,*;q=0.3</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
<property>
<name>storage.data.store.class</name><!-- 注:改屬性用於標示使用何種數據存儲方式 -->
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available:
org.apache.gora.sql.store.SqlStore <!-- 使用關係型數據庫做爲數據庫存儲方式 -->
Default store. A DataStore implementation for RDBMS with a SQL interface.
SqlStore uses JDBC drivers to communicate with the DB. As explained in
ivy.xml, currently >= gora-core 0.3 is not backwards compatable with
SqlStore.
org.apache.gora.cassandra.store.CassandraStore
Gora class for storing data in Apache Cassandra.
org.apache.gora.hbase.store.HBaseStore <!-- 使用hbase做爲數據庫存儲方式 -->
Gora class for storing data in Apache HBase.
org.apache.gora.accumulo.store.AccumuloStore
Gora class for storing data in Apache Accumulo.
org.apache.gora.avro.store.AvroStore
Gora class for storing data in Apache Avro.
org.apache.gora.avro.store.DataFileAvroStore
Gora class for storing data in Apache Avro. DataFileAvroStore is
a file based store which uses Avro's DataFile{Writer,Reader}'s as a backend.
This datastore supports mapreduce.
org.apache.gora.memory.store.MemStore
Gora class for storing data in a Memory based implementation for tests.
</description>
</property>
<property>
<name>generate.batch.id</name>
<value>cdv</value>
</property>
3.6 退回nutch2.2主目錄 進入ivy目錄
修改ivy.xml文件:
1) 註釋掉: <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
2) 放開註釋: <dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
3.7 退回主目錄 執行ant命令 對nutch2.2重新編譯
3.8 進入runtime/local目錄 執行:1) mkdir urls
2) 創建first.txt文件,輸入http://www.tianya.cn
3.9 返回local目錄,執行:bin/nutch
將列出所有的命令 其中:
inject 將需要抓取的urls導入數據庫
generate 從數據庫中獲取需要抓取批量的url信息 放入抓取隊列
fetch 對抓取隊列中的urls 進行抓取 並標示抓取標識
parse 對抓取的網頁就進行解析
updatedb 把新生成的連接更新到數據庫中
solrinedx 對解析的內容放到solr全文檢索庫中
3.10 查看inject 命令詳情:
執行如下命令:bin/nutch urls :
會報如下錯誤:
這個問題是由於:
在創建webpage表時對屬性'text'設置爲:longtext引起的,需修改如下:
1)修改數據庫類型:改爲BLOB
2)修改conf/groa-sql-mapping.xml中:
<field name="text" column="text" length="32000"/> 改爲:<field name="text" column="text"/>
3) 進入nutch2.2主目錄重新編譯
重新執行命令:bin/nutch urls 發現mysql數據庫中多了條記錄
-crawlId 的作用是什麼呢?執行命令:bin/nutch urls -crawlId nutch01 之後卻報錯如下:
報錯原因:
在網上多方查找也沒找到具體原因,最好只得查看源碼進行分析:
1)在InjectJob.java中發下:
if ("-crawlId".equals(args[i])) {
getConf().set(Nutch.CRAWL_ID_KEY, args[i+1]);
i++;
}
也就是說如果設置了-crawlId參數 會將該參數設置到configure配置文件中做爲全局變量
2) 在StorageUtils.java中
String crawlId = conf.get(Nutch.CRAWL_ID_KEY, "");
if (!crawlId.isEmpty()) {
conf.set("schema.prefix", crawlId + "_");
} else {
conf.set("schema.prefix", "");
}
到此終於明白,原來crawlId是做爲表的前綴出現的,因此在執行命令時,如果需要用到-crawlId參數 需要將數據庫中的webpage命名做相應的修改:如bin/nutch urls -crawlId nutch01 則需要將webpage表名改爲:nutch01_webpage
至此,本以爲執行bin/nutch urls -crawlId nutch01命令總不會再報錯了吧,但是事與願違報錯如下:
實在搞不清楚到底是什麼意思,於是便想起logs/hadoop.log中查看錯誤的詳情,詳情如下:
原來是在gora的時候少了方法,那就說明nutch在使用gora-0.3.jar時缺少方法,於是查看了一下gora-0.3中的Persistent.java發現的確缺少getSchema()方法 沒辦法又重新下載了gora-0.2.1源碼 發現該版本中該類有對應的方法,於是重新回到nutch2.2根目錄
進入ivy/ivy.xml:
修改 <dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>爲:
<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>
之後 重新編譯nutch2.2之後 再運行bin/nutch urls -crawlId nutch01 終於沒有錯誤了
之後根據bin/nutch 對generate fetch parse updatedb 的參數說明進行逐步執行。待完善....