Nutch2.3中的crawl和Nutch命令初探

一，環境信息
硬件：虛擬機
操作系統：Centos 6.4 64位
IP：10.51.121.10
主機名：datanode-4
安裝用戶：root
Nutch：Nutch2.3，安裝路徑：/root/nutch/apache-nutch-2.3
Hbase：Hbase0.94.14，安裝路徑：/root/hadoop/hbase-0.94.14
Solr：solr-4.10.3，安裝路徑：/root/nutch/solr-4.10.3
Nutch2.3+Hbase0.94+Solr4.10.3單機集成配置安裝請見：http://blog.csdn.net/freedomboy319/article/details/44172277

二，Crawl命令使用
進入/root/nutch/apache-nutch-2.3/runtime/local/bin目錄，查看crawl文件。

# ./bin/crawl
Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
<seedDir>：放置種子文件的目錄
<crawlID> ：抓取任務的ID
<solrURL>：用於索引及搜索的solr地址
<numberOfRounds>：迭代次數，即抓取深度

crawl腳本的核心部分如下：

commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true"
1，# initial injection
$bin/nutch inject "$SEEDDIR" -crawlId "$CRAWL_ID"
2，#循環調用 generate - fetch - parse - update - index，循環的次數由numberOfRounds參數指定。
1)$bin/nutch generate $commonOptions -topN $sizeFetchlist -noNorm -noFilter -adddays $addDays -crawlId "$CRAWL_ID" -batchId $batchId  
2)$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $batchId -crawlId "$CRAWL_ID" -threads 50
3)$bin/nutch parse $commonOptions $skipRecordsOptions $batchId -crawlId "$CRAWL_ID" 
4)$bin/nutch updatedb $commonOptions $batchId -crawlId "$CRAWL_ID" 
5)如果solrUrl參數不爲空，則
$bin/nutch index $commonOptions -D solr.server.url=$SOLRURL -all -crawlId "$CRAWL_ID"  
$bin/nutch solrdedup $commonOptions $SOLRURL
如果solrUrl參數爲空，則執行結束。

Crawl命令涉及的類如下：
（1）org.apache.nutch.crawl.InjectorJob
（2）org.apache.nutch.crawl.GeneratorJob
（3）org.apache.nutch.fetcher.FetcherJob
（4）org.apache.nutch.parse.ParserJob
（5）org.apache.nutch.crawl.DbUpdaterJob
（6）org.apache.nutch.indexer.IndexingJob
（7）org.apache.nutch.indexer.solr.SolrDeleteDuplicates

三，Nutch命令
進入/root/nutch/apache-nutch-2.3/runtime/local/bin目錄，查看Nutch文件。

# ./bin/nutch 
Usage: nutch COMMAND
where COMMAND is one of:
 inject         inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate       generate new batches to fetch from crawl db
 fetch          fetch URLs marked during generate
 parse          parse URLs marked during fetch
 updatedb       update web table after parsing
 updatehostdb   update host table after parsing
 readdb         read/dump records from page database
 readhostdb     display entries from the hostDB
 index          run the plugin-based indexer on parsed batches
 elasticindex   run the elasticsearch indexer - DEPRECATED use the index command instead
 solrindex      run the solr indexer on parsed batches - DEPRECATED use the index command instead
 solrdedup      remove duplicates from solr
 solrclean      remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
 clean          remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin         load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 webapp         run a local Nutch web application
 junit          runs the given JUnit test
 or
 CLASSNAME      run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

可以參看每個命令的詳細使用：

# ./bin/nutch inject
Usage: InjectorJob <url_dir> [-crawlId <id>]

每個命令對應的類如下：

inject=org.apache.nutch.crawl.InjectorJob
hostinject=org.apache.nutch.host.HostInjectorJob
generate=org.apache.nutch.crawl.GeneratorJob
fetch=org.apache.nutch.fetcher.FetcherJob
parse=org.apache.nutch.parse.ParserJob
updatedb=org.apache.nutch.crawl.DbUpdaterJob
updatehostdb=org.apache.nutch.host.HostDbUpdateJob
readdb=org.apache.nutch.crawl.WebTableReader
readhostdb=org.apache.nutch.host.HostDbReader
elasticindex=org.apache.nutch.indexer.elastic.ElasticIndexerJob
solrindex="org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1" 
index=org.apache.nutch.indexer.IndexingJob
solrdedup=org.apache.nutch.indexer.solr.SolrDeleteDuplicates
solrclean="org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2 $1"
clean=org.apache.nutch.indexer.CleaningJob
parsechecker=org.apache.nutch.parse.ParserChecker
indexchecker=org.apache.nutch.indexer.IndexingFiltersChecker
plugin=org.apache.nutch.plugin.PluginRepository
webapp=org.apache.nutch.webui.NutchUiServer
nutchserver=org.apache.nutch.api.NutchServer
junit=org.junit.runner.JUnitCore

四，執行Crawl命令
進入/root/nutch/apache-nutch-2.3/runtime/local目錄，執行如下命令：

# ./bin/crawl ./myUrls/ mycrawl1 http://localhost:8983/solr/ 2

此命令與如下命令基本基本等價：

# ./bin/nutch inject ./myUrls/ -crawlId mycrawl1
# ./bin/nutch generate -topN 5 -crawlId mycrawl1
# ./bin/nutch fetch -all -crawlId mycrawl1 -threads 5
# ./bin/nutch parse -all -crawlId mycrawl1
# ./bin/nutch updatedb -all -crawlId mycrawl1
# ./bin/nutch solrindex http://localhost:8983/solr/ -all -crawlId mycrawl1
# ./bin/nutch solrdedup http://localhost:8983/solr

可以按照以上步驟一步一執行，並進入Hbase查看mycrawl1_webpage表。
進入Hbase的安裝目錄，執行# ./bin/hbase shell 命令

# ./bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.14, r1543222, Mon Nov 18 23:23:33 UTC 2013

hbase(main):001:0> list
TABLE
mycrawl1_webpage
1 row(s) in 0.5500 seconds
hbase(main):002:0> describe 'mycrawl1_webpage'
DESCRIPTION                                                                                                                     ENABLED
 'mycrawl1_webpage', {NAME => 'f', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS =>  true
 '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_
 MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME => 'h', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER =
 > 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DEL
 ETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME => '
 il', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', M
 IN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_
 DISK => 'true', BLOCKCACHE => 'true'}, {NAME => 'mk', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE
 => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCK
 SIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME => 'mtdt', DATA_BLOCK_ENCODING =
 > 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL =>
  '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE
  => 'true'}, {NAME => 'ol', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', C
 OMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY
  => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME => 'p', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NON
 E', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_C
 ELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME => 's', DA
 TA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERS
 IONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK =>
  'true', BLOCKCACHE => 'true'}
1 row(s) in 0.1820 seconds

hbase(main):003:0> scan 'mycrawl1_webpage'