Nutch教程中文翻譯1(官方教程,中英對照)——Nutch的編譯、安裝和簡單運行

本教程是Nutch官方教程的翻譯,採用逐段翻譯的方法,並加上自己的解釋。

本文由精簡導航提供。

本文原版發佈在CSDN博客精簡導航,並且文章在持續修改和更新。其他網站出現皆爲轉載,轉載的文章不一定完整。請瀏覽原網頁

本教程雖然是Nutch 1.x的教程,但是官網上Nutch2.x的教程只是告訴我們怎麼去配置一些新特性。Nutch2.x的基礎教程,仍在在本教程中。


Introduction

Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over. That’s where Apache Solr comes in. Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward as explained below.

Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing. Just download a binary release from here.


簡介

Apache Nutch是一個開源的JAVA網絡爬蟲。Nutch會幫我們自動管理超鏈接信息,大大減少了維護的時間,比如檢測損壞的鏈接、對已訪問的頁面做副本,提交給搜索引擎。

Solr是一個開源的全文本搜索框架。我們可以通過Solr來搜索Nutch爬取的網頁。慶幸的是,集成Nutch和Solr是非常簡單的。

Apache Nutch支持Solr的out-the-box,大大簡化了Nutch和Solr的集成。現在的版本移除了老版本中,利用tomcat和lucene進行索引的模塊。


非官方註釋:

1.Nutch是一個網絡爬蟲,在搜索引擎中負責爬取網頁,同時自動維護網頁的URL信息,例如相同網頁去重、網頁定時更新、網頁重定向。

2.現在版本的Nutch本身並不具有搜索功能,但是可以自動向搜索服務器提交爬取的網頁。搜索服務器,例如Solr,是另一個開源項目,需要自己下載。

3.可以通過Nutch自帶的命令,來控制Nutch是否將網頁提交給索引服務器。

4.Nutch雖然是優秀的分佈式爬蟲框架,但是它的所有設計,都是爲了搜索引擎服務的。在hadoop上用map-reduce框架開發,並不是很適合做數據抽取的業務。如果你的業務是做數據抽取(精抽取),而不是搜索引擎。不一定要選用Nutch。



Requirements


運行環境需求:

         Unix(linux),或者裝有Cygwin的Windows

          JDK1.5及以上

          Apache Ant


非官方註釋:

         1.強烈建議在Linux/unix上,進行Nutch的開發。如果沒有Linux,建議在windows上裝linux虛擬機。

         2.Apache Ant非常必要。Nutch的整個編譯過程是通過一個叫build.xml的配置文件來控制的。這個配置文件要有Ant纔可以運行。Nutch官方源碼沒有提供Eclipse的配置文件,所以Eclipse不能直接編譯Nutch。雖然可以利用Apache Ant將官方源碼,轉換成Eclipse工程,但是這樣並不是很好。

         3.要閱讀下面的教程,一定要先安裝Linux(或unix、cygwin)、JDK和apache ant,否則下面的步驟將無法進行。雖然安裝這些東西可能需要花費數小時的時間,但是是必須的。

1. Install Nutch

安裝Nutch

Option 1: Setup Nutch from a binary distribution

  • Download a binary package (apache-nutch-1.X-bin.zip) fromhere.

  • Unzip your binary Nutch package. There should be a folder apache-nutch-1.X.

  • cd apache-nutch-1.X/

From now on, we are going to use ${NUTCH_RUNTIME_HOME} to refer to the current directory (apache-nutch-1.X/).


方式一:從二進制發佈包安裝Nutch

          1.下載Nutch1.x的二進制包。

          2.解壓下載的包。裏面應該有個文件夾apache-nutch-1.x。

          3.用命令行進入apache-nutch-1.x文件夾。

爲了簡化描述,本文後面用${NUTCH_RUNTIME_HOME}來表示這裏說的apache-nutch-1.x文件夾。



Option 2: Set up Nutch from a source distribution

Advanced users may also use the source distribution:

  • Download a source package (apache-nutch-1.X-src.zip)

  • Unzip
  • cd apache-nutch-1.X/

  • Run ant in this folder (cf. RunNutchInEclipse)

  • Now there is a directory runtime/local which contains a ready to use Nutch installation.

When the source distribution is used ${NUTCH_RUNTIME_HOME} refers toapache-nutch-1.X/runtime/local/. Note that

  • config files should be modified in apache-nutch-1.X/runtime/local/conf/

  • ant clean will remove this directory (keep copies of modified config files)

方式二:從源碼安裝Nutch

進階用戶可以使用源碼來編譯Nutch

          1.下載Nutch1.x的源碼包。

          2.解壓下載的包。

          3.用命令行進入apache-nutch-1.x文件夾。

          4.用命令行在apache-nutch-1.x文件夾,執行命令:ant

          5.執行命令後,會在apache-nutch-1.x文件夾中,出現一個文件夾runtime,在runtime/local中是編譯成功的Nutch。

爲了簡化描述,本文後面用${NUTCH_RUNTIME_HOME}來表示這裏說的apache-nutch-1.x/runtime/local 文件夾。注意:

           1.如果要修改Nutch的配置文件,請在apache-nutch-1.x/runtime/local/conf/文件夾中修改

            2.如果執行ant clean命令,會清除apache-nutch-1.x/runtime/local/conf/文件夾。所以在執行ant clean之前,請備份文件夾中Nutch的配置文件。


2. Verify your Nutch installation

  • run "bin/nutch" - You can confirm a correct installation if you see something similar to the following:

Usage: nutch COMMAND where command is one of:
crawl             one-step crawler for intranets (DEPRECATED)
readdb            read / dump crawl db
mergedb           merge crawldb-s, with optional filtering
readlinkdb        read / dump link db
inject            inject new urls into the database
generate          generate new segments to fetch from crawl db
freegen           generate new segments to fetch from text files
fetch             fetch a segment's pages
...

Some troubleshooting tips:

  • Run the following command if you are seeing "Permission denied":

chmod +x bin/nutch
  • Setup JAVA_HOME if you are seeing JAVA_HOME not set. On Mac, you can run the following command or add it to ~/.bashrc:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home

On Debian or Ubuntu, you can run the following command or add it to ~/.bashrc:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")


2. 檢測Nutch是否安裝成功

      在${NUTCH_RUNTIME_HOME}文件夾中執行bin/nutch命令,出現下面的結果,說明安裝成功:

Usage: nutch COMMAND where command is one of:
crawl             one-step crawler for intranets (DEPRECATED)
readdb            read / dump crawl db
mergedb           merge crawldb-s, with optional filtering
readlinkdb        read / dump link db
inject            inject new urls into the database
generate          generate new segments to fetch from crawl db
freegen           generate new segments to fetch from text files
fetch             fetch a segment's pages
...
如果出現"Permission denied"錯誤,運行下面命令:

chmod +x bin/nutch

如果出現JAVA_HOME not set,設置JAVA_HOME環境變量.在Mac電腦, 執行下面的命令,或者將下面這行加入到~/.bashrc文件:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home

在Debian或者Ubuntu, 執行下面的命令,或者將下面這行加入到~/.bashrc文件:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

3. Crawl your first website

Nutch requires two configuration changes before a website can be crawled:

  1. Customize your crawl properties, where at a minimum, you provide a name for your crawler for external servers to recognize
  2. Set a seed list of URLs to crawl

3. 爬取第一個網站

    在爬取之前,必須修改2個配置。

     1.配置爬蟲屬性。必須要爲爬蟲起一個名字。

      2.爲爬蟲提供一個URL種子列表。


3.1 Customize your crawl properties

  • Default crawl properties can be viewed and edited within conf/nutch-default.xml - where most of these can be used without modification

  • The file conf/nutch-site.xml serves as a place to add your own custom crawl properties that overwriteconf/nutch-default.xml. The only required modification for this file is to override thevalue field of the http.agent.name     

    • i.e. Add your agent name in the value field of thehttp.agent.name property inconf/nutch-site.xml, for example:

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>


3.1 配置爬取屬性

        1.默認的爬蟲屬性存放在conf/nutch-default.xml文件中。這些屬性大多數都不需要修改。

         2.conf/nutch-site.xml是存放你的個性化配置的地方。conf/nutch-site.xml中出現的屬性,會覆蓋conf/nutch-default.xml中出現的屬性。只有一個屬性是必須要修改的:http.agent.name

         例如,在conf/nutch-site.xml中添加http.agent.name:

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>


3.2 Create a URL seed list

  • A URL seed list includes a list of websites, one-per-line, which nutch will look to crawl
  • The file conf/regex-urlfilter.txt will provide Regular Expressions that allow nutch to filter and narrow the types of web resources to crawl and download

Create a URL seed list

  • mkdir -p urls

  • cd urls

  • touch seed.txt to create a text fileseed.txt underurls/ with the following content (one URL per line for each site you want Nutch to crawl).

http://nutch.apache.org/

(Optional) Configure Regular Expression Filters

Edit the file conf/regex-urlfilter.txt and replace

# accept anything else
+.

with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to thenutch.apache.org domain, the line should read:

 +^http://([a-z0-9]*\.)*nutch.apache.org/

This will include any URL in the domain nutch.apache.org.

NOTE: Not specifying any domains to include within regex-urlfilter.txt will lead to all domains linking to your seed URLs file being crawled as well.


3.2 創建URL種子列表

1.URL種子列表包含一些網址,每個網址寫一行。

2.conf/regex-urlfilter.txt中,利用正則表達式,來約束爬取的範圍。

創建URL種子列表

  • mkdir -p urls

  • cd urls

  • touch seed.txt

修改seed.txt,將seed.txt的內容改爲:

http://nutch.apache.org/

(可選) 配置正則表達式過濾器

編輯 conf/regex-urlfilter.txt 替換

# accept anything else
+.
爲一個和你要爬取的域名匹配的正則。比如你想要限制爬蟲只爬取 nutch.apache.org 域名下的東西, 就替換爲:

+^http://([a-z0-9]*\.)*nutch.apache.org/

3.3 Using the Crawl Command

The crawl command is deprecated. Please see section 3.5 on how to use the crawl script that is intended to replace the crawl command.

Now we are ready to initiate a crawl, use the following parameters:

  • -dir dir names the directory to put the crawl in.

  • -threads threads determines the number of threads that will fetch in parallel.

  • -depth depth indicates the link depth from the root page that should be crawled.

  • -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

  • Run the following command:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5
  • Now you should be able to see the following directories created:

crawl/crawldb
crawl/linkdb
crawl/segments

NOTE: If you have a Solr core already set up and wish to index to it, you are required to add the-solr <solrUrl> parameter to yourcrawl command e.g.

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

If not then please skip to here for how to set up your Solr instance and index your crawl data.

Typically one starts testing one's configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources.

3.3 使用爬取命令

"crawl"命令已經廢棄了,請閱讀後面的章節,查看如何使用crawl腳本,來取代crawl命令。

我們用下面的參數,來配置運行crawl命令

  • -dir dir 存放爬取信息的文件夾

  • -threads threads 線程數

  • -depth depth 爬取深度

  • -topN N 每層爬取的最大頁數

  • 運行下面命令

bin/nutch crawl urls -dir crawl -depth 3 -topN 5
  • 執行命令後,下面的文件夾被創建了:

crawl/crawldb
crawl/linkdb
crawl/segments

注意: 如果你已經架設好了Solr服務器,想用Solr對Nutch的爬取結果進行索引,你可以在crawl命令後添加-solr <solrUrl>參數

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

如果你沒有Solr服務器,可以點擊 here來學習架設Solr服務器,索引你爬取的數據。

通常在測試的時候,將-depth和-topN設置的很小。一旦測試結果比較滿意,將-depth設置在10左右。

-topN的設定,可能是萬級別的,根據你爬取的網站。

發佈了90 篇原創文章 · 獲贊 67 · 訪問量 52萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章