Nutch-0.9 研究 Whole-web Crawling

原創

2020-02-22 18:38

Nutch 得到Related Link以及动态内容

1. vi conf/crawl-urlfilter.txt


#+[?*!@=]

# 添加接受链接带? = &字符的

# accept URLs containing certain characters as probable queries, etc.
+[?=&]

## 抓取程序链接/apps/application.php?id=在网页中是动态的相对链接地址
+^http://www.test01.com/apps/application.php?id=([0-9])

2. vi conf/regex-urlfilter.txt

## 同样添加1.所加的

注意：两个文件都需要修改，因为NUTCH加载规则的顺序是crawl-urlfilter.txt-> regex-urlfilter.txt

3. vi conf/nutch-default.xml或者conf/nutch-site.xml


<property>
  <name>urlfilter.order</name>
  <value>org.apache.nutch.urlfilter.regex.RegexURLFilter</value>
  <description>The order by which url filters are applied.
  If empty, all available url filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
  then RegexURLFilter is applied first, and PrefixURLFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  </description>
</property>

4. 修改conf/nutch-default.xml


<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>