轉載自:http://blog.sina.com.cn/s/blog_6cc084c90100nf39.html
這些天看了其它小組的博客,發現大家用Heritrix抓取所花的時間都比較長,基本都要花上數天的時間才能抓完,名副其實的爬蟲…之所以這麼慢,一個重要的原因是heritrix在抓取時一般只運行了一個線程。在網上查找原因,得知這是因爲在默認的情況下,Heritrix使用HostnameQueueAssignmentPolicy來產生key值,而這個策略是用hostname作爲key值的,因此一個域名下的所有鏈接都會被放到同一個線程中去。如果對Heritrix分配URI時的策略進行改進,利用ELF hash算法把url儘量平均分部到各個隊列中去,就能夠用較多的線程同時抓取一個域名下的網頁,速度將得到大大的提高。
1.在org.archive.crawler.frontier下新建一個ELFHashQueueAssignmentPolicy類,這個類要注意繼承自 QueueAssignmentPolicy。
3. private static final Logger logger = Logger
4. .getLogger(ELFHashQueueAssignmentPolicy .class.getName());
6. public String getClassKey(CrawlController controller,
7. CandidateURI cauri){
8. String uri = cauri.getUURI().toString();
9. long hash = ELFHash(uri);
10. String a = Long.toString(hash % 100);
13. public long ELFHash(String str){
14. long hash = 0;
15. long x = 0;
16. for(int i = 0; i < str.length(); i++){
17. hash = (hash << 4) + str.charAt(i);
18. if((x = hash & 0xF0000000L) != 0){
19. hash ^= (x >> 24);
20. hash &= ~x;
23. return (hash & 0x7FFFFFFF);
String queueStr = System.getProperty(AbstractFrontier.class.getName() +
"." + ATTR_QUEUE_ASSIGNMENT_POLICY,
ELFHashQueueAssignmentPolicy.class.getName() + " " +
HostnameQueueAssignmentPolicy.class.getName() + " " +
BucketQueueAssignmentPolicy.class.getName() + " " +
#############################################################################
# F R O N T I E R
#############################################################################
# List here all queue assignment policies you'd have show as a
# queue-assignment-policy choice in AbstractFrontier derived Frontiers
# (e.g. BdbFrontier).
org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy = \
org.archive.crawler.frontier.ELFHashQueueAssignmentPolicy \
org.archive.crawler.frontier.HostnameQueueAssignmentPolicy \
org.archive.crawler.frontier.IPQueueAssignmentPolicy \
org.archive.crawler.frontier.BucketQueueAssignmentPolicy \
org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy \
org.archive.crawler.frontier.BdbFrontier.level = INFO
* Establishes a mapping from CrawlURIs to String keys (queue names).
* Get the String key (name) of the queue to which the
* CrawlURI should be assigned.
* Note that changes to the CrawlURI, or its associated
* components (such as CrawlServer), may change its queue
* assignment.
由上面的圖可以知道抓取的內容中有一些不需要用到的文件類型,比如pdf,jpeg等等。如何用Heritrix只抓特定的對象,比如只抓HTML型的。Heritrix的官方文檔”Heritrix User Manual”中A.3節給出了一個解決方案:
1)You would first need to create a job with the single seed http://foo.org/bar/. You'll need to add the MirrorWriterProcessor on the Modules screen and delete the ARCWriterProcessor. This will store your files in a directory structure that matches the crawled URIs, and the files will be stored in the crawl job's mirror directory.
2)Your job should use the DecidingScope with the following set of DecideRules:
We are using the NotMatchesFilePatternDecideRule so we can eliminate crawling any URIs that don't end with .html. It's important that this DecideRule be placed immediately before PrerequisiteAcceptDecideRule; otherwise the DNS and robots.txt prerequisites will be rejected since they won't match the regexp.
3)On the Setting screen, you'll want to set the following for the NotMatchesFilePatternDecideRule:
decision: REJECT
use-preset-pattern: CUSTOM
regexp: .*(/|\.html)$
(.*(/|\.(html|htm|xml|asp))$)|(.*\.asp\?.*)
Robots.txt是一種專門用於搜索引擎網絡爬蟲的文件,當構造一個網站時,如果作者希望該網站的內容被搜索引擎收錄,就可以在網站中創建一個純文本文件robots.txt,在這個文件中,聲明該網站不想被robot訪問的部分。這樣,該網站的部分或全部內容就可以不被搜索引擎收錄了,或者指定搜索引擎只收錄指定的內容。因爲大部分的網站並不會放置一個robots.txt文件以供搜索引擎讀取,所以 Heritrix爬蟲在抓取網頁時會花費過多的時間去判斷該Robots.txt文件是否存在,從而增加了抓取時間。好在這個協議本身是一種附加協議,完全可以不遵守。
在Heritrix中,對robots.txt文件的處理是處於PreconditionEnforcer這個Processor中的。PreconditionEnforcer是一個Prefetcher,當處理時,總是需要考慮一下當前這個鏈接是否有什麼先決條件要先被滿足的,而對robots.txt的訪問則正好是其中之一。在PreconditionEnforcer中,有一個private類型的函數,函數聲明爲: private boolean considerRobotsPreconditions(CrawlURI curi) 。該函數的含義爲:在進行對參數所表示的鏈接的抓取前,看一下是否存在一個由robots.txt所決定的先決條件。該函數返回true時的含義爲需要考慮robots.txt文件,返回false時則表示不需要考慮robots.txt文件,可以繼續將鏈接傳遞給後面的處理器。所以,最簡單的修改辦法就是將這個方法整個註釋掉,只返回一個false值。