Web Crawling and Data Miniing with Apache Nutch(翻譯+學習心得)_01

笨小蔥會在這兩個月翻譯完這本傳說中的418元一本的神作。0.0.由於英語很爛，只能說個笨小蔥理解的大概意思，很多地方翻譯不到位請各位大拿指出，我會及時更正的。請多多包涵0.0

Preface

Apache Nutch is an open source web crawler software that is used for crawling

websites. It is extensible and scalable. It provides facilities for parsing, indexing, and

scoring filters for custom implementations. This book is designed for making you

comfortable in applying web crawling and data mining in your existing application.

It will demonstrate real-world problems and give the solutions to those problems

with appropriate use cases.

This book will demonstrate all the practical implementations hands-on so readers

can perform the examples on their own and make themselves comfortable. The

book covers numerous practical implementations and also covers different types

of integrations.

Apache Nutch是一個用來爬取網站的開源網站爬取軟件。他是可擴展的和可伸縮的。（nutch）提供了分析工具，索引和評分過濾器的自定義實現。

這本書被設計用來讓你更輕鬆地應用網站爬取和數據挖掘於你的現有項目中。它將展示實際問題並且通過適當的用例給出這些問題的解決方案。

這本書將會展示實際實現以便於讀者能夠應用這寫例子在他們自己的項目中，讓讀者使用的輕鬆。這本書包含許多實際實現，也包含了不同類型的集成教程

1.Getting Started with Apache Nutch

Apache Nutch is a very robust and scalable tool for web crawling; it can be

integrated with the scripting language Python for web crawling. You can use it

whenever your application contains huge data and you want to apply crawling on

your data.

Apache Nutch是一個非常健壯的和可擴展的網絡爬取工具。它能夠和腳本語言Python集成進行網絡爬去。當你的應用包含大量數據，並且你想要應用爬取在你的數據上時，你就可以使用它（nutch）。

This chapter covers the introduction to Apache Nutch and its installation, and also

guides you on crawling, parsing, and creating plugins with Apache Nutch. It will

start from the basics of how to install Apache Nutch and then will gradually take you

to the crawling of a website and creating your own plugin.

這一章包括Apache Nutch的介紹，它的安裝和指導你進行爬取，分析和創建Apache Nutch插件。我們將從基礎的Apache Nutch安裝開始，然後

逐步帶你爬取一個網站和創建你自己的插件。

In this chapter we will cover the following topics:

• Introducing Apache Nutch

• Installing and configuring Apache Nutch

• Verifying your Nutch installation

• Crawling your first website

• Setting up Apache Solr for search

• Integrating Solr with Nutch

• Crawling websites using crawl script

• Crawling the web, URL filters, and the CrawlDb

• Parsing and parsing filters

• Nutch plugins and Nutch plugin architecture

在這一章節我們將包含如下主題:

<1>介紹Apache Nutch

<2>安裝和配置Apache Nutch

<3>校驗你的Nutch安裝

<4>爬取你的第一個網站

<5>安裝Apache Solr搜索引擎

<6>集成Solr和nutch

<7>使用crawl腳本爬取網站

<8>爬取網站，URL過濾和CRAWLDb

<9>分析和分析過濾器

<10>nutch插件和nutch插件架構

By the end of this chapter, you will be comfortable playing with Apache Nutch as

you will be able to configure Apache Nutch yourself in your own environment and

you will also have a clear understanding about how crawling and parsing take place

with Apache Nutch. Additionally, you will be able to create your own Nutch plugin.

學到這一章的最後，你將能夠很輕鬆的在你自己的環境中獨立配置 Apache Nutch，你也會有一個關於Apache Nutch是怎樣爬取和分析的清楚的理解，

另外，你也能夠創建你自己的nutch插件。

<1>Introduction to Apache Nutch（Apache Nutch介紹）

Apache Nutch is open source WebCrawler software that is used for crawling

websites. You can create your own search engine like Google, if you understand

Apache Nutch clearly. It will provide you with your own search engine, which can

increase your application page rank in searching and also customize your application

searching according to your needs. It is extensible and scalable. It facilitates parsing,

indexing, creating your own search engine, customizing search according to needs,

scalability, robustness, and ScoringFilter for custom implementations. ScoringFilter

is a Java class that is used while creating the Apache Nutch plugin. It is used for

manipulating scoring variables.

Apache Nutch是一個開源的用來爬取網站的網絡爬蟲軟件。如果你清楚地理解了Apache Nutch，你可以創建你自己的像Google一樣的搜索引擎。它能夠提供給你一個你自己的搜索引擎（能夠在搜索中增加你的應用網頁分數和根據你的需求定製你自己的應用搜索方式）。他是可擴展的和可伸縮的。它能夠很容易的分析，索引，創建你自己的搜索引擎，根據需求定製搜索，可擴展性，健壯性和評分過濾器的用戶化實現。評分過濾器是創建Apache Nutch插件時的一個java類，被用來操作評分變量。

We can run Apache Nutch on a single machine as well as on a distributed

environment such as Apache Hadoop. It is written in Java. We can find broken links

using Apache Nutch and create a copy of all the visited pages for searching over,

for example, while building indexes. We can find web page hyperlinks in an

automated manner.

我們可以運行Apache Nutch在一個單機模式下，也可以在一個分佈式環境中，如:Apache Hadoop（它是用java編寫的）。我們可以使用Apache Nutch找到無效連接和創建一個所以搜索瀏覽過的網頁的副本，例如，創建索引的話，我們就能夠通過自動化的方式找到網頁連接。

Apache Nutch can be integrated with Apache Solr easily and we can index all the

web pages that are crawled by Apache Nutch to Apache Solr. We can then use

Apache Solr for searching the web pages which are indexed by Apache Nutch.

Apache Solr is a search platform that is built on top of Apache Lucene. It can be

used for searching any type of data, for example, web pages.

Apache Nutch能夠很容易的集成Apache Solr，我們能夠索引所有的Apache Nutch爬取的網頁給Apache Solr。然後，我們可以使用Apache Solr來搜索這些網頁，Apache Solr是一個建立在Apache Lucene之上的搜索平臺。它能夠用來搜索任何類型的數據，如:網頁。

<2>Installing and configuring Apache Nutch（安裝和配置Apache Nutch）

In this section, we are going to cover the installation and configuration steps of

Apache Nutch. So we will first start with the installation dependencies in Apache

Nutch. After that, we will look at the steps for installing Apache Nutch. Finally, we

will test Apache Nutch by applying crawling on it.

這一節中，包括了Apache Nutch的安裝和配置步驟。首先我們要安裝Apache Nutch的依賴軟件。之後，我們將一步步的安裝Apache Nutch。最後，我們將通過爬取來測試Apache Nutch是否安裝成功。

Installation dependencies（安裝相關依賴）

The dependencies are as follows:（依賴如下：）

• Apache Nutch 2.2.1

• HBase 0.90.4

• Ant

• JDK 1.6

Apache Nutch comes in different branches, for example, 1.x, 2.x, and so on. The key

difference between Apache Nutch 1.x and Apache Nutch 2.x is that in the former,

we have to manually type each command step-by-step for crawling, which will be

explained later in this chapter. In the latter, Apache Nutch developers create a crawl

script that will do crawling for us by just running that script; there is no need to type

commands step-by-step.

Apache Nutch 發佈了不同的版本，如：1.x,2.x等等。

There may be more differences but I have covered just one. nutch1.x和nutch2.x主要不同之處在於他們的模型。我們不得不手工的一步一步的執行爬取命令。後來，Apache Nutch 開發者創建了一個crawl腳本（運行這個腳本就能夠一次執行完爬取命令），就沒必要一步一步的執行了。

There may be more differences but I have covered just one.

I have used Apache Nutch 2.2.1 because it is the latest version at the time of

writing this book. The steps for installation and configuration of Apache Nutch

are as follows:

還有更多的不同，這裏就舉這一個例子。

我事後的是最新版本的Apache Nutch 2.2.1。安裝和配置步驟如下：

（首先需要安裝好jdk1.6以上版本和ant，如果不知道如何安裝請參考笨小蔥的這篇博客：http://blog.csdn.net/sunshine920103/article/details/46777981）

1. Download Apache Nutch from the Apache website. You may download

Nutch from http://nutch.apache.org/downloads.html.

從http://nutch.apache.org/downloads.html.下載 Apache Nutch。（現在最新的是2.3版本。這裏如果使用2.3版本，後面的與mysql集成會出現一些問題，所以笨小蔥建議小夥伴們，還是先用2.2.1版本的nutch。在上面的下載頁面裏，往下拉有個，點擊鏈接就能找到歷屆nutch版本）

2. Click on apache-nutch-2.2.1-src.tar.gz under the Mirrors column in the

Downloads tab. You can extract it by typing the following commands:

#cd $NUTCH_HOME

# tar –zxvf apache-nutch.2.2.1-src.tar.gz

Here $NUTCH_HOME is the directory where your Apache Nutch resides.

下載了tar.gz文件後，進入到文件存放的目錄，運行tar –zxvf apache-nutch.2.2.1-src.tar.gz命令，解壓縮文件。

3. Download HBase. You can get it from

http://archive.apache.org/dist/hbase/hbase-0.90.4/.

HBase is the Apache Hadoop database that is distributed, a big data store,

scalable, and is used for storing large amounts of data. You should use

Apache HBase when you want real-time read/write accessibility of your big

data. It provides modular and linear scalability. Read and write operations

are very consistent. Here, we will use Apache HBase for storing data, which

is crawled by Apache Nutch. Then we can log in to our database and access it

according to our needs.

下載HBase。你可以從這裏下載到http://archive.apache.org/dist/hbase/hbase-0.90.4/ HBase是分佈式的 Apache Hadoop數據庫，可擴展的用來存儲大量數據的大數據存儲容器。當你要實時的讀取你的數據時可以使用Apache HBase。它提供模塊化的和線性的擴展性。讀取操作非常一致。這裏我們使用Apache HBase來存儲 Apache Nutch爬取的數據。然後，我們可以登錄我們的數據庫得到數據根據我們的需求。

4. We now need to extract HBase, for example, Hbase.x.x.tar.gz. Go to the

terminal and reach up to the path where your Hbase.x.x.tar.gz resides.

Then type the following command for extracting it:

tar –zxvf Hbase.x.x.tar.gz

It will extract all the files in the respective folder.

我們需要去提取HBase。進入終端，到達Hbase.x.x.tar.gz文件存放的路徑，執行下面的命令提取：

tar –zxvf Hbase.x.x.tar.gz

5. Now we need to do HBase configuration. First, go to hbase-site.xml,

which you will find in <Your HBase home>/conf and modify it as follows:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>hbase.rootdir</name>

<!— You need to create one directory and assign a path up to that

directory. That directory will be used by Apache Hbase to store

all relevant information</property>

<name>hbase.zookeeper.property.dataDir</name>

<!— You need to create one directory and assign a path up to

that directory. That directory will be used by Apache Hbase

to store all relevant information related to Apache zookeeper

which comes inbuilt with Apache Hbase. Apache Zookeeper is an

open source server which is used for distributed coordination.

You can learn more about Apache Zookeeper from

https://cwiki.apache.org/confluence/display/ZOOKEEPER/Index

</property>

</configuration>

Just make sure that the hosts file under etc contains the loop back address,

which is 127.0.0.1 (in some Linux distributions, it might be 127.0.1.1).

Otherwise you might face an issue while running Apache HBase.

現在我們需要去配置HBase。首先找打這個文件hbase-site.xml，

它存在HBase的根目錄下，找到並修改如下：

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>hbase.rootdir</name>

<value><Your path></value>(這裏的路徑，是用來存儲HBase所有相關信息，你可以指定或創建一個路徑)

</property>

<name>hbase.zookeeper.property.dataDir</name>

<value><Your path></value> （這裏的路徑，是HBase用來存儲和Apache zookeeper所有相關信息，你可以指定或創建一個路徑）

</property>

</configuration>

確保/etc/hosts文件的回調地址是127.0.0.1,不然可能會出錯。

6. Specify Gora backend in nutch-site.xml. You will find this file at $NUTCH_

HOME/conf.

<name>storage.data.store.class</name>

<value>org.apache.gora.hbase.store.HBaseStore</value>

<description>Default class for storing data</description>

</property>

The explanation of the preceding configuration is as follows:

°?nbsp;Find the name of the data store class for storing data of

Apache Nutch:

<name>storage.data.store.class</name>

°?nbsp;Find the database in which all the data related to HBase will reside:

<value>org.apache.gora.hbase.store.HBaseStore</value>

在nutch-site.xml中指定gora後端。該文件在nutch根目錄的conf目錄下，修改如下：

<name>storage.data.store.class</name>(Apache Nutch存儲數據的類名)

<value>org.apache.gora.hbase.store.HBaseStore</value>（指定HBase數據庫）

<description>Default class for storing data</description>

</property>

7. Make sure that the HBasegora-hbase dependency is available in ivy.xml.

You will find this file in <Your Apache Nutch home>/ivy. Put the following

configuration into the ivy.xml file:

<dependency org="org.apache.gora" name="gora-hbase" rev="0.2"

conf="*-

>default" />

The last line would be commented by default. So you need to uncomment it.

在nutch根目錄下的/ivy下的文件ivy.xml中，取消掉如下注釋：

<dependency org="org.apache.gora" name="gora-hbase" rev="0.2"

conf="*-

>default" />

to be continued..................

笨小蔥

發佈了35 篇原創文章 · 獲贊 15 · 訪問量 13萬+

私信關注

Web Crawling and Data Miniing with Apache Nutch(翻譯+學習心得)_01

Pentaho從零單排之一（簡介與安裝）

網站數據分析指標簡介

phpass0.1版本對應的javascript版本重寫

MD5加密代碼

Javascript中對Date類的操作函數

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結