1.目標
-
Nutch2.2.1+Elaticsearch0.90.5+Mysql5.5.12
2.環境準備
-
Nutch2.2.1 下載
-
Elaticsearch0.90.5 下載
-
Mysql5.5.12 下載
-
Java6 下載
-
Ant(ivy)下載
3.數據庫初始化
-
創建數據庫
CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
-
創建表
CREATE TABLE `webpage` (
`id` varchar(255) NOT NULL,
`headers` blob,
`text` mediumtext,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
`batchId` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
4. Nutch源碼導入Eclipse
-
tar xvfz apache-nutch-2.2.1-src.tar.gz
-
cd apache-nutch-2.2.1
-
vi ivy/ivy.xml
<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
-
ant eclipse
-
New Project->Java Project from Existing Ant Buildfile->select “apache-nutch-2.2.1”
-
選中“apache-nutch-2.2.1”選“屬性”-》Java Build Path -》Source-》設置“apache-nutch-2.2.1/conf”
-
修改 apache-nutch-2.2.1/conf/gora.properties
###############################
# Default MySQL properties #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://10.11.156.226:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=nutch
gora.sqlstore.jdbc.password=nutch
-
修改 apache-nutch-2.2.1/conf/nutch-default.xml
<property>
<name>plugin.folders</name>
<value>build/plugins</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
-
修改 apache-nutch-2.2.1/conf/nutch-site.xml
<configuration>
<property>
<name>http.agent.name</name>
<value>CsNutchSpider</value>
</property>
<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>
<property>
<name>generate.batch.id</name>
<value>*</value>
</property>
</configuration>
5. Nutch源碼調試
-
選擇“src/java”新建目錄urls/seed.txt(一行一個url)
http://cloudscape.sohu.com
-
選擇“src/java”下 Crawler.java, 可以進行 Debug or Run
-
配置參數:
Program arguments:urls -depth 3 -topN 5
VM arguments:-Xms512m -Xmx1024m -XX:MaxPermSize=256m
-
查看數據庫是否有抓取的url信息
6. Nutch一些問題:
-
後續補充
一切 OK,可以對源碼進行二次開發!