Nutch-0.9 研究 Whole-web Crawling

原創

2020-02-22 18:38

Nutch 得到Related Link以及動態內容

1. vi conf/crawl-urlfilter.txt


#+[?*!@=]

# 添加接受鏈接帶? = &字符的

# accept URLs containing certain characters as probable queries, etc.
+[?=&]

## 抓取程序鏈接/apps/application.php?id=在網頁中是動態的相對鏈接地址
+^http://www.test01.com/apps/application.php?id=([0-9])

2. vi conf/regex-urlfilter.txt

## 同樣添加1.所加的

注意：兩個文件都需要修改，因爲NUTCH加載規則的順序是crawl-urlfilter.txt-> regex-urlfilter.txt

3. vi conf/nutch-default.xml或者conf/nutch-site.xml


<property>
  <name>urlfilter.order</name>
  <value>org.apache.nutch.urlfilter.regex.RegexURLFilter</value>
  <description>The order by which url filters are applied.
  If empty, all available url filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
  then RegexURLFilter is applied first, and PrefixURLFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  </description>
</property>

4. 修改conf/nutch-default.xml


<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

lovejuan1314

發佈了23 篇原創文章 · 獲贊 2 · 訪問量 1萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Nutch-0.9 研究 Whole-web Crawling

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

據說是97年國際程序員大賽一等獎作品

Hadoop Safe mode

Mac OS X 下擴展虛擬機硬盤空間

Oracle JDBC Driver BUG-4390875

Oracle expdp/impdp 導出導出

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結