看了nutch關於crawl的源碼後,我將crawl的日誌分析了一下,主要是熟悉一下整個下載、分析、索引的過程。nutch在整個過程中都是通過Hadoop的MapReduce來實現的。
可以通過nutch來深入學習Hadoop編程,都是比較橫的代碼。這一塊待以後研究完畢後,blog出來。
crawl通過nutch-default.xml參數來控制運行過程,另外需要修改crawl-urlfilter.txt和nutch-site.xml來配合運行,這一點在之前的文章中都提到過,就不再贅述了。
下面是我用來crawl的命令
> bin/nutch crawl urls -dir crawl10 -depth 10 -threads 10 >& nohup.out
crawl started in: crawl10 //表明網目錄絡蜘蛛的名稱
rootUrlDir = urls //待下載數據的列表文件或列表
threads = 10 //下載線程爲10個
depth = 10 //深度是10層
Injector: starting //注入下載列表
Injector: crawlDb: crawl10/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries. //根據注入的列表生成待下載的地址庫
Injector: Merging injected urls into crawl db. //執行merge
Injector: done
Generator: Selecting best-scoring urls due for fetch. //判斷網頁重要性,決定下載順序
Generator: starting
Generator: segment: crawl10/segments/20080904102201 //生成下載結果存儲的數據段
Generator: filtering: false
Generator: topN: 2147483647 //沒有指定topN大小,nutch會取默認值
Generator: Partitioning selected urls by host, for politeness. //將url下載列表按hadoop的中配置文件slaves中定義的datanode來分配。
Generator: done.
Fetcher: starting
Fetcher: segment: crawl10/segments/20080904102201 //下載指定網頁內容到segment中去
Fetcher: done
CrawlDb update: starting //下載完畢後,更新下載數據庫,增加新的下載
CrawlDb update: db: crawl10/crawldb
CrawlDb update: segments: [crawl10/segments/20080904102201]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
//循環執行下載
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl10/segments/20080904102453
Generator: filtering: false
Generator: topN: 2147483647
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl10/segments/20080904102453
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl10/crawldb
CrawlDb update: segments: [crawl10/segments/20080904102453]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
...... //一共循環10次,Nutch的局域網模式採用了廣度優先策略,把二級頁面抓取完成以後,進行三級頁面抓取。
LinkDb: starting //進行網頁鏈接關係分析
LinkDb: linkdb: crawl10/linkdb
LinkDb: URL normalize: true //規範化
LinkDb: URL filter: true //根據crawl-urlfilter.txt來過濾
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904102201
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904102453
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904102841
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904104322
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904113511
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904132510
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904153615
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904175052
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904194724
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904211956
LinkDb: done //鏈接分析完畢
Indexer: starting //開始創建索引
Indexer: linkdb: crawl10/linkdb
Indexer: adding segment: /user/nutch/crawl10/segments/20080904102201
Indexer: adding segment: /user/nutch/crawl10/segments/20080904102453
Indexer: adding segment: /user/nutch/crawl10/segments/20080904102841
Indexer: adding segment: /user/nutch/crawl10/segments/20080904104322
Indexer: adding segment: /user/nutch/crawl10/segments/20080904113511
Indexer: adding segment: /user/nutch/crawl10/segments/20080904132510
Indexer: adding segment: /user/nutch/crawl10/segments/20080904153615
Indexer: adding segment: /user/nutch/crawl10/segments/20080904175052
Indexer: adding segment: /user/nutch/crawl10/segments/20080904194724
Indexer: adding segment: /user/nutch/crawl10/segments/20080904211956
Indexer: done //索引創建完畢
Dedup: starting //風頁去重
Dedup: adding indexes in: crawl10/indexes
Dedup: done
merging indexes to: crawl10/index //索引合併
Adding /user/nutch/crawl10/indexes/part-00000
Adding /user/nutch/crawl10/indexes/part-00001
Adding /user/nutch/crawl10/indexes/part-00002
Adding /user/nutch/crawl10/indexes/part-00003
done merging //合併完畢
crawl finished: crawl10 //入口注入、循環下載、鏈接分析、建立索引、去重、合併
抓取程序自動在用戶根目錄(/user/nutch)下面建立了crawl10目錄,可以看到crawldb,segments,index,indexs,linkdb目錄,
1)crawldb目錄下面存放下載的URL,以及下載的日期,用來頁面更新檢查時間。
2)linkdb目錄存放URL的關聯關係,是下載完成後分析時創建的,通過這個關聯關係可以實現類似google的pagerank功能。
3)segments目錄存儲抓取的頁面,下面子目錄的個數與獲取頁面的層數有關係,我指定-depth是10層,這個目錄下就有10層。
裏面有6個子目錄
content,下載頁面的內容
crawl_fetch,下載URL的狀態內容
crawl_generate,待下載的URL的集合,在generate任務生成時和下載過程中持續分析出來
crawl_parse,存放用來更新crawldb的外部鏈接庫
parse_data,存放每個URL解析出來的外部鏈接和元數據
parse_text,存放每個解析過的URL的文本內容
4)index目錄存放符合lucene格式的索引目錄,是indexs裏所有的索引內容合併後的完整內容,看了一下這裏的索引文件和用lucenedemo做出來的文件名稱都不一樣,待進一步研究
5)indexs目錄存放每次下載的索引目錄,存放part-0000到part-0003
crawl流程和工作目錄搞清楚了,大家可以進一步拜讀一下nutch源碼。進一步搞清楚各個目錄下面的文件作用,以及文件格式。待續!
Nutch開源搜索引擎的crawl日誌分析及工作目錄說明
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.