Nutch-0.9 研究 Whole-web Crawling

Nutch 得到Related Link以及动态内容

1. vi conf/crawl-urlfilter.txt

#+[?*!@=]

# 添加接受链接带? = &字符的

# accept URLs containing certain characters as probable queries, etc.
+[?=&]

## 抓取程序链接/apps/application.php?id=在网页中是动态的相对链接地址
+^http://www.test01.com/apps/application.php?id=([0-9])

2. vi conf/regex-urlfilter.txt

## 同样添加1.所加的


注意:两个文件都需要修改,因为NUTCH加载规则的顺序是crawl-urlfilter.txt-> regex-urlfilter.txt

3. vi conf/nutch-default.xml或者conf/nutch-site.xml

<property>
<name>urlfilter.order</name>
<value>org.apache.nutch.urlfilter.regex.RegexURLFilter</value>
<description>The order by which url filters are applied.
If empty, all available url filters (as dictated by properties
plugin-includes and plugin-excludes above) are loaded and applied in system
defined order. If not empty, only named filters are loaded and applied
in given order. For example, if this property has value:
org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
then RegexURLFilter is applied first, and PrefixURLFilter second.
Since all filters are AND'ed, filter ordering does not have impact
on end result, but it may have performance implication, depending
on relative expensiveness of filters.
</description>
</property>


4. 修改conf/nutch-default.xml

<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
发布了23 篇原创文章 · 获赞 2 · 访问量 1万+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章