Heritrix提高抓取效率的若干嘗試

轉載自：http://blog.sina.com.cn/s/blog_6cc084c90100nf39.html

前段忙於其他的功課,精力沒有放在這邊，這星期把重心移回到Heritrix上，做了幾個提高Heritrix抓取效率的嘗試，所得的結果還比較滿意。在此將所做的工作總結下。

一．利用ELFHash策略多線程抓取網頁

這些天看了其它小組的博客，發現大家用Heritrix抓取所花的時間都比較長，基本都要花上數天的時間才能抓完，名副其實的爬蟲…之所以這麼慢，一個重要的原因是heritrix在抓取時一般只運行了一個線程。在網上查找原因，得知這是因爲在默認的情況下，Heritrix使用HostnameQueueAssignmentPolicy來產生key值，而這個策略是用hostname作爲key值的，因此一個域名下的所有鏈接都會被放到同一個線程中去。如果對Heritrix分配URI時的策略進行改進，利用ELF hash算法把url儘量平均分部到各個隊列中去，就能夠用較多的線程同時抓取一個域名下的網頁，速度將得到大大的提高。

具體的做法如下：

1．在org.archive.crawler.frontier下新建一個ELFHashQueueAssignmentPolicy類，這個類要注意繼承自 QueueAssignmentPolicy。

2．在該類下編寫代碼如下：

1． public class ELFHashQueueAssignmentPolicy extends QueueAssignmentPolicy

2． {

3． private static final Logger logger = Logger

4． .getLogger(ELFHashQueueAssignmentPolicy .class.getName());

5．

6． public String getClassKey(CrawlController controller,

7． CandidateURI cauri){

8． String uri = cauri.getUURI().toString();

9． long hash = ELFHash(uri);

10． String a = Long.toString(hash % 100);

11． return a;

12． }

13． public long ELFHash(String str){

14． long hash = 0;

15． long x = 0;

16． for(int i = 0; i < str.length(); i++){

17． hash = (hash << 4) + str.charAt(i);

18． if((x = hash & 0xF0000000L) != 0){

19． hash ^= (x >> 24);

20． hash &= ~x;

21． }

22． }

23． return (hash & 0x7FFFFFFF);

24． }

25．}

3. 修改AbstractFrontier 類的AbstractFrontier方法：

關鍵代碼段是：

   String queueStr = System.getProperty(AbstractFrontier.class.getName() +
                 "." + ATTR_QUEUE_ASSIGNMENT_POLICY,
                  ELFHashQueueAssignmentPolicy.class.getName() + " " +

HostnameQueueAssignmentPolicy.class.getName() + " " +

IPQueueAssignmentPolicy.class.getName() + " " +

BucketQueueAssignmentPolicy.class.getName() + " " +

SurtAuthorityQueueAssignmentPolicy.class.getName());

Pattern p = Pattern.compile("\\s*,\\s*|\\s+");

String [] queues = p.split(queueStr);

其中紅色部分是新加的代碼。

4. 修改heritrix.properties 中的配置

        #############################################################################
        # F R O N T I E R
        #############################################################################

        # List here all queue assignment policies you'd have show as a
        # queue-assignment-policy choice in AbstractFrontier derived Frontiers
        # (e.g. BdbFrontier).
        org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy = \

org.archive.crawler.frontier.ELFHashQueueAssignmentPolicy \

org.archive.crawler.frontier.HostnameQueueAssignmentPolicy \

org.archive.crawler.frontier.IPQueueAssignmentPolicy \

org.archive.crawler.frontier.BucketQueueAssignmentPolicy \

org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy \

org.archive.crawler.frontier.TopmostAssignedSurtQueueAssignmentPolicy

org.archive.crawler.frontier.BdbFrontier.level = INFO

紅色部分爲新加部分。

抓取速度得到了很大的提高，1.4G的網頁8個多小時就抓好了。在Hosts欄裏顯示，只抓取了ccer.pku.edn.cn域名下的網頁。

一些分析：

1）添加的ELFHash算法程序並不算很複雜。ELFhash算法的基本思想是：將一個字符串的數組中的每個元素依次按前四位與上一個元素的低四位相與，組成一個長整形，如果長整的高四位大於零，那麼就將它折回再與長整的低四位相異或，這樣最後得到的長整對HASH表長取餘，得到在HASH中的位置。

2）ELFHash函數將輸入的字符串進行哈希計算，輸出算出的整數型哈希值。getClassKey函數中調用了ELFHash函數計算出哈希值，轉換爲字符串型返回上一層。之所以取模100是因爲一般情況下Heritrix開100個線程，對應100個不同的URI處理隊列。

3) QueueAssignmentPolicy類源程序裏的說明：

* Establishes a mapping from CrawlURIs to String keys (queue names).

* Get the String key (name) of the queue to which the

* CrawlURI should be assigned.

* Note that changes to the CrawlURI, or its associated

* components (such as CrawlServer), may change its queue

* assignment.

可知該類建立抓取到的URI和抓取隊列名之間的映射。這個類是個抽象類,不同的策略由不同的子類實現，如根據域名、IP等。

4）AbstractFrontier類是調度器基本實現類，是一個非常複雜的類，沒有仔細研究。這裏加在裏面的程序作用大概是將ELFHashQueueAssignmentPolicy這個策略加入到運行時所使用的URI分配策略中。在heritrix.properties中的修改也同樣爲這個目的。

5）由上可見使用這個策略後，速度有了非常大的提高。但抓下來的1.4G數據相比之前抓下來的有點小，大概是max-retries值設置得太低（原來是30，改爲5），導致不少東西沒有抓下來。

二．只抓取HTML對象

由上面的圖可以知道抓取的內容中有一些不需要用到的文件類型，比如pdf，jpeg等等。如何用Heritrix只抓特定的對象，比如只抓HTML型的。Heritrix的官方文檔”Heritrix User Manual”中A.3節給出了一個解決方案：

1）You would first need to create a job with the single seed http://foo.org/bar/. You'll need to add the MirrorWriterProcessor on the Modules screen and delete the ARCWriterProcessor. This will store your files in a directory structure that matches the crawled URIs, and the files will be stored in the crawl job's mirror directory.

2）Your job should use the DecidingScope with the following set of DecideRules:

RejectDecideRule

SurtPrefixedDecideRule

TooManyHopsDecideRule

PathologicalPathDecideRule

TooManyPathSegmentsDecideRule

NotMatchesFilePatternDecideRule

PrerequisiteAcceptDecideRule

We are using the NotMatchesFilePatternDecideRule so we can eliminate crawling any URIs that don't end with .html. It's important that this DecideRule be placed immediately before PrerequisiteAcceptDecideRule; otherwise the DNS and robots.txt prerequisites will be rejected since they won't match the regexp.

3）On the Setting screen, you'll want to set the following for the NotMatchesFilePatternDecideRule:

decision: REJECT

use-preset-pattern: CUSTOM

regexp: .*(/|\.html)$

根據需要，將正則表達式進行修改以滿足需要，在這裏更改爲：

(.*(/|\.(html|htm|xml|asp))$)|(.*\.asp\?.*)

三．取消Robots.txt的限制

Robots.txt是一種專門用於搜索引擎網絡爬蟲的文件，當構造一個網站時，如果作者希望該網站的內容被搜索引擎收錄，就可以在網站中創建一個純文本文件robots.txt，在這個文件中，聲明該網站不想被robot訪問的部分。這樣，該網站的部分或全部內容就可以不被搜索引擎收錄了，或者指定搜索引擎只收錄指定的內容。因爲大部分的網站並不會放置一個robots.txt文件以供搜索引擎讀取，所以 Heritrix爬蟲在抓取網頁時會花費過多的時間去判斷該Robots.txt文件是否存在，從而增加了抓取時間。好在這個協議本身是一種附加協議，完全可以不遵守。

在Heritrix中，對robots.txt文件的處理是處於PreconditionEnforcer這個Processor中的。PreconditionEnforcer是一個Prefetcher，當處理時，總是需要考慮一下當前這個鏈接是否有什麼先決條件要先被滿足的，而對robots.txt的訪問則正好是其中之一。在PreconditionEnforcer中，有一個private類型的函數，函數聲明爲： private boolean considerRobotsPreconditions(CrawlURI curi) 。該函數的含義爲：在進行對參數所表示的鏈接的抓取前，看一下是否存在一個由robots.txt所決定的先決條件。該函數返回true時的含義爲需要考慮robots.txt文件，返回false時則表示不需要考慮robots.txt文件，可以繼續將鏈接傳遞給後面的處理器。所以，最簡單的修改辦法就是將這個方法整個註釋掉，只返回一個false值。

網上聲稱使用這種辦法可以提高抓取速度一半以上，由於抓取所花時間比較多，沒有進行對比比較。以上的抓取都是在去除robots.txt情況下進行的。

Heritrix提高抓取效率的若干嘗試

Window 安裝 Python 失敗 0x80070643，發生嚴重錯誤

《最新出爐》系列入門篇-Python+Playwright自動化測試-41-錄製視頻

我的友情鏈接

【轉載】heritrix抓取網頁信息

Heritrix配置——限定爬取範圍爲某一特定網站

我的友情鏈接

Heritrix提高抓取效率的若干嘗試

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結