Nutch開源搜索引擎的crawl日誌分析及工作目錄說明

看了nutch關於crawl的源碼後,我將crawl的日誌分析了一下,主要是熟悉一下整個下載、分析、索引的過程。nutch在整個過程中都是通過Hadoop的MapReduce來實現的。 
可以通過nutch來深入學習Hadoop編程,都是比較橫的代碼。這一塊待以後研究完畢後,blog出來。 

crawl通過nutch-default.xml參數來控制運行過程,另外需要修改crawl-urlfilter.txt和nutch-site.xml來配合運行,這一點在之前的文章中都提到過,就不再贅述了。 

下面是我用來crawl的命令 
> bin/nutch crawl urls -dir crawl10 -depth 10 -threads 10 >& nohup.out 

crawl started in: crawl10                               //表明網目錄絡蜘蛛的名稱 
rootUrlDir = urls //待下載數據的列表文件或列表
threads = 10 //下載線程爲10個 
depth = 10 //深度是10層 
Injector: starting //注入下載列表
Injector: crawlDb: crawl10/crawldb 
Injector: urlDir: urls 
Injector: Converting injected urls to crawl db entries. //根據注入的列表生成待下載的地址庫 
Injector: Merging injected urls into crawl db. //執行merge
Injector: done 
Generator: Selecting best-scoring urls due for fetch. //判斷網頁重要性,決定下載順序 
Generator: starting 
Generator: segment: crawl10/segments/20080904102201 //生成下載結果存儲的數據段 
Generator: filtering: false 
Generator: topN: 2147483647 //沒有指定topN大小,nutch會取默認值
Generator: Partitioning selected urls by host, for politeness.  //將url下載列表按hadoop的中配置文件slaves中定義的datanode來分配。 
Generator: done. 
Fetcher: starting 
Fetcher: segment: crawl10/segments/20080904102201 //下載指定網頁內容到segment中去 
Fetcher: done 
CrawlDb update: starting //下載完畢後,更新下載數據庫,增加新的下載 
CrawlDb update: db: crawl10/crawldb 
CrawlDb update: segments: [crawl10/segments/20080904102201] 
CrawlDb update: additions allowed: true 
CrawlDb update: URL normalizing: true 
CrawlDb update: URL filtering: true 
CrawlDb update: Merging segment data into db. 
CrawlDb update: done 

//循環執行下載 
Generator: Selecting best-scoring urls due for fetch. 
Generator: starting 
Generator: segment: crawl10/segments/20080904102453 
Generator: filtering: false 
Generator: topN: 2147483647 
Generator: Partitioning selected urls by host, for politeness. 
Generator: done. 
Fetcher: starting 
Fetcher: segment: crawl10/segments/20080904102453 
Fetcher: done 
CrawlDb update: starting 
CrawlDb update: db: crawl10/crawldb 
CrawlDb update: segments: [crawl10/segments/20080904102453] 
CrawlDb update: additions allowed: true 
CrawlDb update: URL normalizing: true 
CrawlDb update: URL filtering: true 
CrawlDb update: Merging segment data into db. 
CrawlDb update: done 

...... //一共循環10次,Nutch的局域網模式採用了廣度優先策略,把二級頁面抓取完成以後,進行三級頁面抓取。 

LinkDb: starting //進行網頁鏈接關係分析 
LinkDb: linkdb: crawl10/linkdb 
LinkDb: URL normalize: true //規範化 
LinkDb: URL filter: true //根據crawl-urlfilter.txt來過濾 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904102201
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904102453 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904102841 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904104322 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904113511 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904132510 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904153615 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904175052 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904194724 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904211956 
LinkDb: done //鏈接分析完畢 
Indexer: starting //開始創建索引 
Indexer: linkdb: crawl10/linkdb 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904102201 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904102453 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904102841 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904104322 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904113511 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904132510 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904153615 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904175052 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904194724 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904211956 
Indexer: done //索引創建完畢 
Dedup: starting //風頁去重 
Dedup: adding indexes in: crawl10/indexes
Dedup: done 
merging indexes to: crawl10/index //索引合併 
Adding /user/nutch/crawl10/indexes/part-00000 
Adding /user/nutch/crawl10/indexes/part-00001 
Adding /user/nutch/crawl10/indexes/part-00002 
Adding /user/nutch/crawl10/indexes/part-00003 
done merging //合併完畢 
crawl finished: crawl10 //入口注入、循環下載、鏈接分析、建立索引、去重、合併 


抓取程序自動在用戶根目錄(/user/nutch)下面建立了crawl10目錄,可以看到crawldb,segments,index,indexs,linkdb目錄, 
1)crawldb目錄下面存放下載的URL,以及下載的日期,用來頁面更新檢查時間。 
2)linkdb目錄存放URL的關聯關係,是下載完成後分析時創建的,通過這個關聯關係可以實現類似google的pagerank功能。 
3)segments目錄存儲抓取的頁面,下面子目錄的個數與獲取頁面的層數有關係,我指定-depth是10層,這個目錄下就有10層。 
  裏面有6個子目錄 
  content,下載頁面的內容 
  crawl_fetch,下載URL的狀態內容 
  crawl_generate,待下載的URL的集合,在generate任務生成時和下載過程中持續分析出來 
  crawl_parse,存放用來更新crawldb的外部鏈接庫 
  parse_data,存放每個URL解析出來的外部鏈接和元數據 
  parse_text,存放每個解析過的URL的文本內容 
4)index目錄存放符合lucene格式的索引目錄,是indexs裏所有的索引內容合併後的完整內容,看了一下這裏的索引文件和用lucenedemo做出來的文件名稱都不一樣,待進一步研究 
5)indexs目錄存放每次下載的索引目錄,存放part-0000到part-0003 

crawl流程和工作目錄搞清楚了,大家可以進一步拜讀一下nutch源碼。進一步搞清楚各個目錄下面的文件作用,以及文件格式。待續!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章