Nutch-0.9 研究 Whole-web Crawling

Nutch 得到Related Link以及動態內容

1. vi conf/crawl-urlfilter.txt

#+[?*!@=]

# 添加接受鏈接帶? = &字符的

# accept URLs containing certain characters as probable queries, etc.
+[?=&]

## 抓取程序鏈接/apps/application.php?id=在網頁中是動態的相對鏈接地址
+^http://www.test01.com/apps/application.php?id=([0-9])

2. vi conf/regex-urlfilter.txt

## 同樣添加1.所加的


注意:兩個文件都需要修改,因爲NUTCH加載規則的順序是crawl-urlfilter.txt-> regex-urlfilter.txt

3. vi conf/nutch-default.xml或者conf/nutch-site.xml

<property>
<name>urlfilter.order</name>
<value>org.apache.nutch.urlfilter.regex.RegexURLFilter</value>
<description>The order by which url filters are applied.
If empty, all available url filters (as dictated by properties
plugin-includes and plugin-excludes above) are loaded and applied in system
defined order. If not empty, only named filters are loaded and applied
in given order. For example, if this property has value:
org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
then RegexURLFilter is applied first, and PrefixURLFilter second.
Since all filters are AND'ed, filter ordering does not have impact
on end result, but it may have performance implication, depending
on relative expensiveness of filters.
</description>
</property>


4. 修改conf/nutch-default.xml

<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
發佈了23 篇原創文章 · 獲贊 2 · 訪問量 1萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章