nutch2.2 mysql 安裝

官網介紹如下：

Apache Nutch is a highly extensible and scalable open source web crawler software project.

nutch是一個高度可擴展和可伸縮的開源的網絡爬蟲項目

nutch1.x與nutch2.x的區別：

storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.

存儲使用Apache Gora來處理對象持久化映射，從而使nutch的存儲脫離了任何底層數據存儲

先查看linux環境，如下：

現在就nutch2.2.1 + mysql5.6部署如下：

1 安裝 mysql5.6 (略)

安裝完成mysql之後對my.cnf進行如下修改

2 進入mysql 創建表

CREATE TABLE `webpage` (

`id` varchar(767) NOT NULL,

`headers` blob,

`text` longtext DEFAULT NULL,

`status` int(11) DEFAULT NULL,

`markers` blob,

`parseStatus` blob,

`modifiedTime` bigint(20) DEFAULT NULL,

`prevModifiedTime` bigint(20) DEFAULT NULL,

`score` float DEFAULT NULL,

`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,

`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,

`baseUrl` varchar(767) DEFAULT NULL,

`content` longblob,

`title` varchar(2048) DEFAULT NULL,

`reprUrl` varchar(767) DEFAULT NULL,

`fetchInterval` int(11) DEFAULT NULL,

`prevFetchTime` bigint(20) DEFAULT NULL,

`inlinks` mediumblob,

`prevSignature` blob,

`outlinks` mediumblob,

`fetchTime` bigint(20) DEFAULT NULL,

`retriesSinceFetch` int(11) DEFAULT NULL,

`protocolStatus` blob,

`signature` blob,

`metadata` blob,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

3 nutch2.2安裝

3.1 下載nutch2.2源碼

3.2 下載ant

3.3 進入nutch2.2目錄如圖(注：在編譯之前沒有 /runtime 目錄)：

3.4 執行:ant命令進行編譯,編譯之後會發現目錄中增加了一個 runtime目錄該目錄下會有兩個目錄deploy和local 分別表示：部署模式和本地模式

deploy和local目錄結構如下：

3.5 我們進入local目錄進行本地部署

進入conf目錄:

在groa.properties文件中修改mysql的數據庫連接如下：

gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://127.0.0.1:3306/nutch
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=root

在nutch-site.xml文件中<configuration></configuration>中增加屬性如下：

<property>
<name>http.agent.name</name>
<value>nutch01</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.accept.language</name>
<value>en-us,en-gb,en,zh-cn,zh-tw;q=0.7,*;q=0.3</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available:
org.apache.gora.sql.store.SqlStore 
Default store. A DataStore implementation for RDBMS with a SQL interface.
SqlStore uses JDBC drivers to communicate with the DB. As explained in
ivy.xml, currently >= gora-core 0.3 is not backwards compatable with
SqlStore.
org.apache.gora.cassandra.store.CassandraStore
Gora class for storing data in Apache Cassandra.
org.apache.gora.hbase.store.HBaseStore 
Gora class for storing data in Apache HBase.
org.apache.gora.accumulo.store.AccumuloStore
Gora class for storing data in Apache Accumulo.
org.apache.gora.avro.store.AvroStore
Gora class for storing data in Apache Avro.
org.apache.gora.avro.store.DataFileAvroStore
Gora class for storing data in Apache Avro. DataFileAvroStore is
a file based store which uses Avro's DataFile{Writer,Reader}'s as a backend.
This datastore supports mapreduce.
org.apache.gora.memory.store.MemStore
Gora class for storing data in a Memory based implementation for tests.
</description>
</property>

<property>
<name>generate.batch.id</name>
<value>cdv</value>
</property>

3.6 退回nutch2.2主目錄進入ivy目錄

修改ivy.xml文件：

1) 註釋掉： <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

2) 放開註釋： <dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>

3.7 退回主目錄執行ant命令對nutch2.2重新編譯

3.8 進入runtime/local目錄執行：1) mkdir urls

2) 創建first.txt文件,輸入http://www.tianya.cn

3.9 返回local目錄，執行：bin/nutch

將列出所有的命令其中：

inject 將需要抓取的urls導入數據庫

generate 從數據庫中獲取需要抓取批量的url信息放入抓取隊列

fetch 對抓取隊列中的urls 進行抓取並標示抓取標識

parse 對抓取的網頁就進行解析

updatedb 把新生成的連接更新到數據庫中

solrinedx 對解析的內容放到solr全文檢索庫中

3.10 查看inject 命令詳情：

執行如下命令：bin/nutch urls ：

會報如下錯誤：

這個問題是由於：

在創建webpage表時對屬性'text'設置爲：longtext引起的,需修改如下：

1)修改數據庫類型：改爲BLOB

2)修改conf/groa-sql-mapping.xml中：

3) 進入nutch2.2主目錄重新編譯

重新執行命令：bin/nutch urls 發現mysql數據庫中多了條記錄

-crawlId 的作用是什麼呢？執行命令：bin/nutch urls -crawlId nutch01 之後卻報錯如下：

報錯原因：

在網上多方查找也沒找到具體原因，最好只得查看源碼進行分析：

1)在InjectJob.java中發下：

if ("-crawlId".equals(args[i])) {
getConf().set(Nutch.CRAWL_ID_KEY, args[i+1]);
i++;
}

也就是說如果設置了-crawlId參數會將該參數設置到configure配置文件中做爲全局變量

2) 在StorageUtils.java中

String crawlId = conf.get(Nutch.CRAWL_ID_KEY, "");

if (!crawlId.isEmpty()) {
conf.set("schema.prefix", crawlId + "_");
} else {
conf.set("schema.prefix", "");
}

到此終於明白,原來crawlId是做爲表的前綴出現的,因此在執行命令時,如果需要用到-crawlId參數需要將數據庫中的webpage命名做相應的修改：如bin/nutch urls -crawlId nutch01 則需要將webpage表名改爲：nutch01_webpage

至此，本以爲執行bin/nutch urls -crawlId nutch01命令總不會再報錯了吧，但是事與願違報錯如下：

實在搞不清楚到底是什麼意思，於是便想起logs/hadoop.log中查看錯誤的詳情,詳情如下：

原來是在gora的時候少了方法，那就說明nutch在使用gora-0.3.jar時缺少方法,於是查看了一下gora-0.3中的Persistent.java發現的確缺少getSchema()方法沒辦法又重新下載了gora-0.2.1源碼發現該版本中該類有對應的方法,於是重新回到nutch2.2根目錄

進入ivy/ivy.xml：

修改 <dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>爲：

<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>

之後重新編譯nutch2.2之後再運行bin/nutch urls -crawlId nutch01 終於沒有錯誤了

之後根據bin/nutch 對generate fetch parse updatedb 的參數說明進行逐步執行。待完善....

nutch2.2 mysql 安裝

python gdal 安裝使用（Windows， python 3.6.8）

SpringMVC+Spring3+Hibernate4的開發環境搭建

Atlas簡介

Hadoop中Partitioner解析

Java Map排序

Java運行時異常

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結