Nutch2.X In Eclipse

1.目標

  • Nutch2.2.1+Elaticsearch0.90.5+Mysql5.5.12


2.環境準備

  • Nutch2.2.1 下載

  • Elaticsearch0.90.5 下載

  • Mysql5.5.12 下載

  • Java6 下載

  • Ant(ivy)下載

3.數據庫初始化

  • 創建數據庫

    CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

  • 創建表

    CREATE TABLE `webpage` (

      `id` varchar(255) NOT NULL,

      `headers` blob,

      `text` mediumtext,

      `status` int(11) DEFAULT NULL,

      `markers` blob,

      `parseStatus` blob,

      `modifiedTime` bigint(20) DEFAULT NULL,

      `score` float DEFAULT NULL,

      `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,

      `baseUrl` varchar(767) DEFAULT NULL,

      `content` longblob,

      `title` varchar(2048) DEFAULT NULL,

      `reprUrl` varchar(767) DEFAULT NULL,

      `fetchInterval` int(11) DEFAULT NULL,

      `prevFetchTime` bigint(20) DEFAULT NULL,

      `inlinks` mediumblob,

      `prevSignature` blob,

      `outlinks` mediumblob,

      `fetchTime` bigint(20) DEFAULT NULL,

      `retriesSinceFetch` int(11) DEFAULT NULL,

      `protocolStatus` blob,

      `signature` blob,

      `metadata` blob,

      `batchId` varchar(255) DEFAULT NULL,

      PRIMARY KEY (`id`)

    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;


4. Nutch源碼導入Eclipse

  • tar xvfz apache-nutch-2.2.1-src.tar.gz

  • cd apache-nutch-2.2.1

  • vi ivy/ivy.xml

 <dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>

 <dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />

 <dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/> 

  • ant eclipse

  • New Project->Java Project from Existing Ant Buildfile->select “apache-nutch-2.2.1”

  • 選中“apache-nutch-2.2.1”選“屬性”-》Java Build Path -》Source-》設置“apache-nutch-2.2.1/conf”

  • 修改 apache-nutch-2.2.1/conf/gora.properties

    ###############################

    # Default MySQL properties    #

    ###############################

    gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver

    gora.sqlstore.jdbc.url=jdbc:mysql://10.11.156.226:3306/nutch?createDatabaseIfNotExist=true

    gora.sqlstore.jdbc.user=nutch

    gora.sqlstore.jdbc.password=nutch


  • 修改 apache-nutch-2.2.1/conf/nutch-default.xml

    <property>

      <name>plugin.folders</name>

      <value>build/plugins</value>

      <description>Directories where nutch plugins are located.  Each

      element may be a relative or absolute path.  If absolute, it is used

      as is.  If relative, it is searched for on the classpath.</description>

    </property>

  • 修改 apache-nutch-2.2.1/conf/nutch-site.xml

    <configuration>

    <property>

    <name>http.agent.name</name>

    <value>CsNutchSpider</value>

    </property>

    <property>

    <name>http.accept.language</name>

    <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>

    </property>

    <property>

    <name>storage.data.store.class</name>

    <value>org.apache.gora.sql.store.SqlStore</value>

    </property>

    <property>

    <name>parser.character.encoding.default</name>

    <value>utf-8</value>

    </property>

    <property>

    <name>generate.batch.id</name>

    <value>*</value>

    </property>

    </configuration>

5. Nutch源碼調試

  • 選擇“src/java”新建目錄urls/seed.txt(一行一個url)

      http://cloudscape.sohu.com

  • 選擇“src/java”下 Crawler.java,  可以進行 Debug  or  Run

  • 配置參數:

      Program arguments:urls -depth 3 -topN 5

      VM arguments:-Xms512m -Xmx1024m -XX:MaxPermSize=256m


  • 查看數據庫是否有抓取的url信息

6. Nutch一些問題:

  • 後續補充


一切 OK,可以對源碼進行二次開發!


發佈了58 篇原創文章 · 獲贊 2 · 訪問量 2萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章