转载自:http://blog.sina.com.cn/s/blog_6cc084c90100nf39.html
这些天看了其它小组的博客,发现大家用Heritrix抓取所花的时间都比较长,基本都要花上数天的时间才能抓完,名副其实的爬虫…之所以这么慢,一个重要的原因是heritrix在抓取时一般只运行了一个线程。在网上查找原因,得知这是因为在默认的情况下,Heritrix使用HostnameQueueAssignmentPolicy来产生key值,而这个策略是用hostname作为key值的,因此一个域名下的所有链接都会被放到同一个线程中去。如果对Heritrix分配URI时的策略进行改进,利用ELF hash算法把url尽量平均分部到各个队列中去,就能够用较多的线程同时抓取一个域名下的网页,速度将得到大大的提高。
1.在org.archive.crawler.frontier下新建一个ELFHashQueueAssignmentPolicy类,这个类要注意继承自 QueueAssignmentPolicy。
3. private static final Logger logger = Logger
4. .getLogger(ELFHashQueueAssignmentPolicy .class.getName());
6. public String getClassKey(CrawlController controller,
7. CandidateURI cauri){
8. String uri = cauri.getUURI().toString();
9. long hash = ELFHash(uri);
10. String a = Long.toString(hash % 100);
13. public long ELFHash(String str){
14. long hash = 0;
15. long x = 0;
16. for(int i = 0; i < str.length(); i++){
17. hash = (hash << 4) + str.charAt(i);
18. if((x = hash & 0xF0000000L) != 0){
19. hash ^= (x >> 24);
20. hash &= ~x;
23. return (hash & 0x7FFFFFFF);
String queueStr = System.getProperty(AbstractFrontier.class.getName() +
"." + ATTR_QUEUE_ASSIGNMENT_POLICY,
ELFHashQueueAssignmentPolicy.class.getName() + " " +
HostnameQueueAssignmentPolicy.class.getName() + " " +
BucketQueueAssignmentPolicy.class.getName() + " " +
#############################################################################
# F R O N T I E R
#############################################################################
# List here all queue assignment policies you'd have show as a
# queue-assignment-policy choice in AbstractFrontier derived Frontiers
# (e.g. BdbFrontier).
org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy = \
org.archive.crawler.frontier.ELFHashQueueAssignmentPolicy \
org.archive.crawler.frontier.HostnameQueueAssignmentPolicy \
org.archive.crawler.frontier.IPQueueAssignmentPolicy \
org.archive.crawler.frontier.BucketQueueAssignmentPolicy \
org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy \
org.archive.crawler.frontier.BdbFrontier.level = INFO
* Establishes a mapping from CrawlURIs to String keys (queue names).
* Get the String key (name) of the queue to which the
* CrawlURI should be assigned.
* Note that changes to the CrawlURI, or its associated
* components (such as CrawlServer), may change its queue
* assignment.
由上面的图可以知道抓取的内容中有一些不需要用到的文件类型,比如pdf,jpeg等等。如何用Heritrix只抓特定的对象,比如只抓HTML型的。Heritrix的官方文档”Heritrix User Manual”中A.3节给出了一个解决方案:
1)You would first need to create a job with the single seed http://foo.org/bar/. You'll need to add the MirrorWriterProcessor on the Modules screen and delete the ARCWriterProcessor. This will store your files in a directory structure that matches the crawled URIs, and the files will be stored in the crawl job's mirror directory.
2)Your job should use the DecidingScope with the following set of DecideRules:
We are using the NotMatchesFilePatternDecideRule so we can eliminate crawling any URIs that don't end with .html. It's important that this DecideRule be placed immediately before PrerequisiteAcceptDecideRule; otherwise the DNS and robots.txt prerequisites will be rejected since they won't match the regexp.
3)On the Setting screen, you'll want to set the following for the NotMatchesFilePatternDecideRule:
decision: REJECT
use-preset-pattern: CUSTOM
regexp: .*(/|\.html)$
(.*(/|\.(html|htm|xml|asp))$)|(.*\.asp\?.*)
Robots.txt是一种专门用于搜索引擎网络爬虫的文件,当构造一个网站时,如果作者希望该网站的内容被搜索引擎收录,就可以在网站中创建一个纯文本文件robots.txt,在这个文件中,声明该网站不想被robot访问的部分。这样,该网站的部分或全部内容就可以不被搜索引擎收录了,或者指定搜索引擎只收录指定的内容。因为大部分的网站并不会放置一个robots.txt文件以供搜索引擎读取,所以 Heritrix爬虫在抓取网页时会花费过多的时间去判断该Robots.txt文件是否存在,从而增加了抓取时间。好在这个协议本身是一种附加协议,完全可以不遵守。
在Heritrix中,对robots.txt文件的处理是处于PreconditionEnforcer这个Processor中的。PreconditionEnforcer是一个Prefetcher,当处理时,总是需要考虑一下当前这个链接是否有什么先决条件要先被满足的,而对robots.txt的访问则正好是其中之一。在PreconditionEnforcer中,有一个private类型的函数,函数声明为: private boolean considerRobotsPreconditions(CrawlURI curi) 。该函数的含义为:在进行对参数所表示的链接的抓取前,看一下是否存在一个由robots.txt所决定的先决条件。该函数返回true时的含义为需要考虑robots.txt文件,返回false时则表示不需要考虑robots.txt文件,可以继续将链接传递给后面的处理器。所以,最简单的修改办法就是将这个方法整个注释掉,只返回一个false值。