走進 Apache Nutch (v1.14)

Apache Nutch

Apache Nutch 起源於 Apache Lucene 項目，是高可擴展性和高可伸縮性的開源 web 爬蟲軟件項目。項目主頁：

http://nutch.apache.org/

出於底層數據存儲多樣性的設計，目前該項目在兩個代碼分支上持續開發，分別是：

   ● Nutch 1.x ：成熟的產品級 web 爬蟲，這個分支通過精細的優化配製，充分利用了具有非常強大的批處理能力的Apache Hadoop數據
                  結構。目前該分支最新版本是 2017 年12月23日發佈的 Nutch 1.14，基於 Hadoop 2.7.4 版本開發。

                  通過如下網址訪問該版本的發佈報告：

                  https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12340218

   ● Nutch 2.x ：從 1.x 分離出來的新興版本，最重要的不同在於：將底層具體的數據存儲抽象爲存儲層，利用 Apache Gora 作爲數據
                  中間層來處理對象的持久化映射。這樣用戶可以實現非常靈活數據模型來存儲任何數據到各種類型的 NoSQL 存儲方案中，
                  例如，抓取時間(etch time), 狀態(status), 內容(content), 解析的文本(parsed text), 出鏈(outlinks),入鏈 (
                  inlinks) 等等。

                  目前該分支的最新版本是 2016 年1月21日發佈的 Nutch 2.3.1。通過如下網址訪問該版本的發佈報告：

                  https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12329371

                  該版本建議的 Gora 後端數據存儲層爲：

                   □ Apache Avro 1.7.6
                   □ Apache Hadoop 1.2.1 and 2.5.2
                   □ Apache HBase 0.98.8-hadoop2 (although also tested with 1.X)
                   □ Apache Cassandra 2.0.2
                   □ Apache Solr 4.10.3
                   □ MongoDB 2.6.X
                   □ Apache Accumlo 1.5.1
                   □ Apache Spark 1.4.1

                  根據 2018-1-17 和 2018-4-18 的 nutch 核心團隊會議備忘錄(Nutch Board meeting minutes)，目前 nutch 2.4 版正在開發
                  當中，計劃幾周之後發佈：

                   https://whimsy.apache.org/board/minutes/Nutch.html

                   ...
                   The release of 2.4 is on the agenda.

                   ...
                   We plan to release 2.4 during the next weeks.


通過其可插入式和模塊化的設計，Nutch 提供了可擴展性接口，例如 Parse，Index 以及 ScoringFilter 用於用戶自定義實現。例如，
Apache Tika 就是 Parse 的實現。另外，有爲 Apache Solr, SolrCloud, Elastic Search 等實現的可插入式索引 (indexing) 模塊。

Nutch 可以在單獨的機器上運行，但要獲得其強大的高可擴展性和高可伸縮性分佈式處理能力還是要將其運行在 Hadoop 集羣上。

1. 系統要求 (Requirements)：
-------------------------------------------------------------------------------------------------------------------------
   ● Linux/Unix 環境，或 windows Cygwin 環境
   ● JDK 1.8/Java 8
   ● Apache Ant (只有從源代碼構建時需要)

2. Nutch 安裝 (Install Nutch)：
-------------------------------------------------------------------------------------------------------------------------
下載頁面：

http://nutch.apache.org/downloads.html

Nutch 本地安裝有兩種選擇：

2.1 二進制發佈包安裝(from a binary distribution) 步驟
-------------------------------------------------------------------------------------------------------------------------
① 下載 nutch 二進制發佈包：

   儘量從本地區鏡像站點下載，速度較快，下面是清華鏡像站點下載地址：

   https://mirrors.tuna.tsinghua.edu.cn/apache/nutch/1.14/apache-nutch-1.14-bin.tar.gz

② 解壓下載的二進制包：

       [devalone@nutch ~]$ tar -zxvf apache-nutch-1.14-bin.tar.gz

       [devalone@nutch ~]$ ll
       總用量 243272
       drwxrwxr-x 7 devalone devalone       123 7月 19 15:50 apache-nutch-1.14
       -rw-rw-r-- 1 devalone devalone 249107211 7月 19 15:48 apache-nutch-1.14-bin.tar.gz

③ 安裝 apache-nutch-1.14

   將解壓的得到 apache-nutch-1.14 文件夾移動到合適的安裝位置，例如安裝到 /opt/nutch/ 目錄下：

       [devalone@nutch ~]$ sudo mv apache-nutch-1.14 /opt/nutch/

   創建合適的符號鏈接，以利於不同版本 nutch 管理，例如：

   [devalone@nutch nutch]$ pwd
   /opt/nutch
   [devalone@nutch nutch]$ ls -l
   總用量 0
   drwxrwxr-x 7 devalone devalone 123 6月 29 16:52 apache-nutch-1.14
   lrwxrwxrwx 1 root     root       8 6月 29 17:01 nutch -> nutch1.x
   lrwxrwxrwx 1 root     root      17 6月 29 17:00 nutch1.x -> apache-nutch-1.14

   編輯 /etc/profile 文件，在其中建立 NUTCH_INSTALL 環境變量，修改 PATH 環境變量，將 nutch/bin 包含到 PATH 中，如下所示：

       JAVA_HOME=/usr/java/default
       PATH=$JAVA_HOME/bin:$PATH

NUTCH_INSTALL=/opt/nutch/nutch
PATH=$NUTCH_INSTALL/bin:$PATH

       export PATH JAVA_HOME NUTCH_INSTALL

   保存文件，執行 source /etc/profile 使其立即生效。


   NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   NUTCH_HOME 環境變量在 nutch 腳本內部使用，因此在 shell 環境中配置時儘量不要使用這個變量，以免造成不必要的麻煩。


   NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   Nutch 發佈的二進制發佈包只適用於本機模式，不適用於分佈式環境運行。因此，要在 hadoop 分佈式或僞分佈式環境中運行 Nutch，
   必須使用源代碼編譯方式。

2.2 Nutch 源碼安裝(from a source distribution) 步驟
-------------------------------------------------------------------------------------------------------------------------

2.2.1 構建 Nutch 編譯環境
-------------------------------------------------------------------------------------------------------------------------
源碼安裝 Nutch 需要 ant 配合 ivy 依賴類庫構建編譯環境，過程如下：

[devalone@online apache]$ wget http://mirrors.tuna.tsinghua.edu.cn/apache//ant/binaries/apache-ant-1.10.5-bin.tar.gz
[devalone@online apache]$ wget http://mirrors.hust.edu.cn/apache//ant/ivy/2.5.0-rc1/apache-ivy-2.5.0-rc1-bin.tar.gz

   [devalone@online apache]$ ll
   總用量 8352
   -rw-rw-r--. 1 devalone devalone 5829814 7月 13 18:10 apache-ant-1.10.5-bin.tar.gz
   -rw-rw-r--. 1 devalone devalone 2716983 4月 12 23:46 apache-ivy-2.5.0-rc1-bin.tar.gz

[devalone@nutch apache]$ tar -zxvf apache-ant-1.10.5-bin.tar.gz
[devalone@nutch apache]$ tar -zxvf apache-ivy-2.5.0-rc1-bin.tar.gz

   [devalone@nutch apache]$ sudo mv apache-ant-1.10.5-bin.tar.gz /usr/local/lib/apache/ant/
   [devalone@nutch apache]$ cd /usr/local/lib/apache/ant/
   [devalone@nutch ant]$ sudo ln -s apache-ant-1.10.5 ant

[devalone@nutch ant]$ pwd
/usr/local/lib/apache/ant/ant

將 ivy 的 jar 包複製到 ant 的 lib/ 目錄下：

   [devalone@nutch apache-ivy-2.5.0-rc1]$ cp ivy-2.5.0-rc1.jar /usr/local/lib/apache/ant/ant/lib/

   NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   實際上 nutch 源碼包 ivy/ 目錄下自帶了 ivy-2.4.0.jar 文件，也可以直接使用這個 jar 包文件，而不必自己單獨下載 ivy 的 jar
   文件。

編輯 /etc/profile 文件，設置 ANT_HOME 環境變量，並將 ant/bin 添加到 PATH 中：

JAVA_HOME=/usr/java/default
PATH=$JAVA_HOME/bin:$PATH

ANT_HOME=/usr/local/lib/apache/ant/ant
PATH=$ANT_HOME/bin:$PATH

NUTCH_INSTALL=/opt/nutch/nutch
PATH=$NUTCH_INSTALL/bin:$PATH

export PATH JAVA_HOME ANT_HOME NUTCH_INSTALL

保存退出，執行：

[devalone@nutch ant]$ source /etc/profile

使其立即生效。

   [devalone@nutch ant]$ ant -version
   Apache Ant(TM) version 1.10.5 compiled on July 10 2018

編譯環境安裝完成。

2.2.2 執行 Nutch 源碼安裝
-------------------------------------------------------------------------------------------------------------------------
① 下載 Nutch 源碼包，同樣使用清華鏡像站點下載：

https://mirrors.tuna.tsinghua.edu.cn/apache/nutch/1.14/apache-nutch-1.14-src.tar.gz

② 解壓源碼包：

[devalone@nutch ~]$ tar -zxvf apache-nutch-1.14-src.tar.gz

   [devalone@nutch ~]$ ll
   總用量 5268
   drwxrwxr-x 3 devalone devalone     109 7月 20 10:12 apache
   drwxrwxr-x 7 devalone devalone     162 7月 20 10:36 apache-nutch-1.14
   -rw-rw-r-- 1 devalone devalone 5390394 7月 20 10:36 apache-nutch-1.14-src.tar.gz

③ 進入解壓後的源碼目錄： apache-nutch-1.14

[devalone@nutch ~]$ cd apache-nutch-1.14

   [devalone@nutch apache-nutch-1.14]$ ll
   總用量 504
   -rw-rw-r-- 1 devalone devalone 54529 12月 19 2017 build.xml
   -rw-rw-r-- 1 devalone devalone 106848 12月 19 2017 CHANGES.txt
   drwxr-xr-x 2 devalone devalone   4096 7月 20 10:36 conf
   -rw-rw-r-- 1 devalone devalone   6297 12月 19 2017 default.properties
   drwxr-xr-x 3 devalone devalone     17 12月 19 2017 docs
   drwxr-xr-x 2 devalone devalone    148 7月 20 10:36 ivy
   drwxr-xr-x 3 devalone devalone     20 12月 19 2017 lib
   -rw-rw-r-- 1 devalone devalone 329066 12月 19 2017 LICENSE.txt
   -rw-rw-r-- 1 devalone devalone    427 12月 19 2017 NOTICE.txt
   drwxr-xr-x 7 devalone devalone     76 12月 19 2017 src

④ 在當前目錄下執行源碼編譯：

   ● sonar-ant-task-2.2.jar
   ---------------------------------------------------------------------------------------------------------------------
   編譯 nutch 依賴 sonar-ant-task-2.2.jar 任務包，但 ivy 無法加載該包的資源定義 org/sonar/ant/antlib.xml，需要手動下載，
   否則出現如下錯誤：
       [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
       ivy-download:
       [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

   下載地址：
       https://download.csdn.net/download/devalone/10554266

   將下載解壓的 sonar-ant-task-2.2.jar 包文件複製到 ant/lib/ 目錄下：
       [devalone@nutch ant]$ sudo cp ~/sonar-ant-task-2.2.jar ant/lib/


   ● 編輯 ivy/ivysettings.xml 文件，替換較快的 ivy 倉庫：
   ---------------------------------------------------------------------------------------------------------------------
   [devalone@nutch apache-nutch-1.14]$ vi ivy/ivysettings.xml

   Nutch 的 ivysettings.xml 配置了 3 個 maven 模塊倉庫：

       <property name="oss.sonatype.org"
       value="http://oss.sonatype.org/content/repositories/releases/"
       override="false"/>
      <property name="repo.maven.org"
       value="http://repo1.maven.org/maven2/"
       override="false"/>
      <property name="repository.apache.org"
       value="https://repository.apache.org/content/repositories/snapshots/"
       override="false"/>


   將這 3 個倉庫替換爲較快的倉庫，示例：

       http://maven.aliyun.com/nexus/content/groups/public/
       http://maven.aliyun.com/nexus/content/repositories/central/
       http://maven.aliyun.com/nexus/content/repositories/apache-snapshots/


   本例使用本地網絡環境內配置好的 maven 倉庫：

       <property name="oss.sonatype.org"
       value="http://repo.sansovo.org:8081/nexus/content/groups/public/"
       override="false"/>
      <property name="repo.maven.org"
       value="http://repo.sansovo.org:8081/nexus/content/groups/public/"
       override="false"/>
      <property name="repository.apache.org"
       value="http://repo.sansovo.org:8081/nexus/content/groups/public-snapshot/"
       override="false"/>

   ● 執行編譯：
   ---------------------------------------------------------------------------------------------------------------------
   [devalone@nutch apache-nutch-1.14]$ ant
   Buildfile: /home/devalone/apache-nutch-1.14/build.xml
   Trying to override old definition of task javac

ivy-probe-antlib:

ivy-download:

ivy-download-unchecked:

ivy-init-antlib:

ivy-init:

init:

clean-default-lib:

   resolve-default:
   [ivy:resolve] :: Apache Ivy 2.5.0-rc1 - 20180412005306 :: http://ant.apache.org/ivy/ ::
   [ivy:resolve] :: loading settings :: file = /home/devalone/apache-nutch-1.14/ivy/ivysettings.xml
   [ivy:resolve] downloading http://repo.sansovo.org:8081/nexus/content/groups/public/org/apache/commons
   /commons-compress/1.14/commons-compress-1.14.jar ...
   [ivy:resolve] ........................................................................................
   ......................................... (517kB)
   [ivy:resolve] .. (0kB)
   [ivy:resolve]   [SUCCESSFUL ] org.apache.commons#commons-compress;1.14!commons-compress.jar (6470ms)
   [ivy:resolve] downloading http://repo.sansovo.org:8081/nexus/content/groups/public/org/apache/hadoop/
   hadoop-common/2.7.4/hadoop-common-2.7.4.jar ...
   [ivy:resolve] ........................................................................................
   ..............................................

   ....

BUILD SUCCESSFUL
Total time: 28 minutes 5 seconds

   NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   第一次構建過程中最好不要中斷，以免 ivy 依賴庫文件下載損壞，使編譯發生一些奇怪的錯誤。如果確實出現錯誤，可以清除掉
   ~/.ivy2/cache/ 目錄重新構建。

   之後如果再次構建，~/.ivy2 庫已完整下載，構建會很快完成，如：

   ...
   BUILD SUCCESSFUL
   Total time: 35 seconds

⑤ 編譯完成後，在源碼目錄下生成 runtime 目錄，該目錄下即是構建出的二進制可執行文件包：

   [devalone@nutch apache-nutch-1.14]$ ll runtime/
   總用量 0
   drwxrwxr-x 3 devalone devalone 46 7月 21 11:36 deploy
   drwxrwxr-x 7 devalone devalone 67 7月 21 11:37 local

其中 deploy 目錄爲分佈式可執行文件包，local 目錄爲本地模式可執行文件包。將 local 目錄按照 "二進制發佈包安裝" 中提示的方法
安裝到合適的目錄下即可。

   NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   執行 ant clean 指令會清除 runtime 目錄，注意備份配置文件。

2.3 驗證 Nutch 安裝
-------------------------------------------------------------------------------------------------------------------------
無論使用哪種方式安裝，運行如下指令：

   [devalone@nutch ~]$ nutch
   nutch 1.14
   Usage: nutch COMMAND
   where COMMAND is one of:
      readdb            read / dump crawl db
      mergedb           merge crawldb-s, with optional filtering
      readlinkdb        read / dump link db
      inject            inject new urls into the database
      generate          generate new segments to fetch from crawl db
      freegen           generate new segments to fetch from text files
      fetch             fetch a segment's pages
      parse             parse a segment's pages
      readseg           read / dump segment data
      mergesegs         merge several segments, with optional filtering and slicing
   ...

輸出 nutch 程序的版本號和使用方法，以及命令簡要說明，說明 Nutch 本地模式運行已成功安裝。

   NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   最好將本機主機名指向準確的 IP 地址，避免不必要的錯誤，下面是這臺測試機器的設置(有 DNS 配置更好)：

       [devalone@nutch ~]$ cat /etc/hosts
       127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4 nutch.sansovo.org nutch
       ::1         localhost localhost.localdomain localhost6 localhost6.localdomain6 nutch.sansovo.org nutch
       192.168.1.113   nutch.sansovo.org nutch

3. 爬取 web 站點
-------------------------------------------------------------------------------------------------------------------------
在爬取 web 站點之前，nutch 要求必須要做兩處配置：

① 配置爬取屬性，至少要配置爬蟲名稱。
② 配置要爬取的 URL 種子列表。

3.1 配置爬取屬性
-------------------------------------------------------------------------------------------------------------------------
默認的爬取屬性由 $NUTCH_INSTALL/conf/nutch-default.xml 文件設置，不要在這個文件內修改配置，即多數情況下不要修改此文件，最好
將其設爲只讀屬性。nutch-default.xml 文件中包含了所有屬性及其說明，應花些時間仔細研讀每一個屬性的說明，有利於理解和深度配置
nutch 程序。

$NUTCH_INSTALL/conf/nutch-site.xml文件用於配置自定義的爬取屬性，這個文件中配置的屬性會覆蓋 $NUTCH_INSTALL/conf/nutch-default.xml
文件中配置的默認屬性值。因此，應將所有的自定義配置屬性設置在 $NUTCH_INSTALL/conf/nutch-site.xml 文件中。

● nutch 要求在運行前至少要配置 http.agent.name 屬性。因此打開 conf/nutch-site.xml 文件，配置類似如下的內容：

   [devalone@nutch nutch]$ vi conf/nutch-site.xml
   <property>
   <name>http.agent.name</name>
   <value>My Nutch Spider</value>
   </property>

該屬性值隨便設置一個合理的字符串，只要不是空串或空白串就好。

   NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   http.agent.name 屬性的配置其實是有些講究的。有些站點會配置 robots 協議或者其他機制禁止網絡爬蟲的爬取。針對 robots 協
   議，可以將 http.agent.name 設置爲正常瀏覽器 User-Agent 請求頭屬性值。獲取方法很多，這裏不贅述。示例：

   <property>
      <name>http.agent.name</name>
      <value>Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0</value>
      <description>HTTP 'User-Agent' request header. MUST NOT be empty -
      please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

       http.robots.agents
       http.agent.description
       http.agent.url
       http.agent.email
       http.agent.version

and set their values appropriately.

      </description>
   </property>

   或者 Linux 上的：

       <value>Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0</value>

可通過百度資源平臺 https://ziyuan.baidu.com/robots 對網站的 robots 配置進行檢測分析，瞭解某個網站的 robots 配置情況。

● 另外要確認 plugin.includes 屬性中包含所索引器插件 indexer-solr，用於之後建立 solr 索引。該屬性在默認配置文件
conf/nutch-default.xml 中設置爲：

   <property>
      <name>plugin.includes</name>
      <value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|\
      urlnormalizer-(pass|regex|basic)</value>
      <description>Regular expression naming plugin directory names to
      include. Any plugin not matching this expression is excluded.
      In any case you need at least include the nutch-extensionpoints plugin. By
      default Nutch includes crawling just HTML and plain text via HTTP,
      and basic indexing and search plugins. In order to use HTTPS please enable
      protocol-httpclient, but be aware of possible intermittent problems with the
      underlying commons-httpclient library. Set parsefilter-naivebayes for classification based focused crawler.
      </description>
   </property>

可以看出，indexer-solr 插件已經包含到系統配置中，因此，保持默認配置就好。

   NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   注意 plugin.includes 屬性中並沒有包含 protocol-httpclient，因此按照屬性描述，當前配置應只支持 http 協議，不支持https
   協議的 URL。要使 nutch 支持 https 協議，需要向該屬性中添加 protocol-httpclient 插件的支持。但底層的commons-httpclient
   類庫不太穩定，可能會發生問題。出於穩定性的目的，暫時不配置 protocol-httpclient 插件支持，以後需要時再來精心設置並測試
   其穩定性。

   但 plugin.includes 屬性對 protocol-httpclient 插件的描述並不準確，後面的測試會特意增加一個 https 協議的 URL，以測試
   nutch 的執行結果，並得出相應結論。

3.2 配置 URL 種子列表(URL seed list)
-------------------------------------------------------------------------------------------------------------------------
● URL 種子列表包含 nutch 要搜索的 web 站點 URL 列表，一行一個 URL。
● conf/regex-urlfilter.txt 過濾器文件使用正則表達式來過濾和限定允許 nutch 爬取和下載的的資源類型。

3.2.1 創建 URL 種子列表
-------------------------------------------------------------------------------------------------------------------------
創建一個特定的目錄用於存儲 URL 列表文件，在該目錄下創建一個或多個文本文件，每個文本文件就是一個種子列表文件。編輯文件，向
其中添加要爬取的 web 站點 URL 地址，一行一個 URL。

示例：
   [devalone@nutch nutch]$ pwd
   /opt/nutch/nutch
   [devalone@nutch nutch]$ mkdir -p urls
   [devalone@nutch nutch]$ cd urls

   [devalone@nutch urls]$ vi seed.txt
   [devalone@nutch urls]$ cat seed.txt
   http://news.sohu.com/
   http://mil.sohu.com/
   https://blog.csdn.net/

爲了驗證 https 協議，特意在列表中添加了一個 https 協議的 URL: https://blog.csdn.net/

3.2.2 配置正則表達式過濾器(可選的)
-------------------------------------------------------------------------------------------------------------------------
正則表達式過濾器由 conf/regex-urlfilter.txt 文件配置，用於限定 nutch 爬取的資源在可控的範圍內。如果失去控制，nutch 很可能
爬取整個互聯網的 web 資源，恐怕會耗費幾周的時間。

其中任何一個非註釋的，非空白的行包含一個正則表達式。

正則表達式以 + 或 - 開始：

● + 符號: 表示允許 nutch 爬取的資源。
● - 符號: 表示禁止 nutch 爬取的資源。

一個 URL 在此文件中的第一個匹配模式確定了其是包含到爬取列表還是被忽略掉。如果在此文件中沒有任何模式匹配，則該 URL 被忽略。

本例限定 nutch 在種子列表中列出的網址範圍內爬取，其它網站忽略，因此編輯 conf/regex-urlfilter.txt 文件，註釋掉如下的正則表
達式：

# accept anything else
# +.

添加如下正則表達式：

+^https?://(news|mil)\.sohu\.com/
+^https?://blog\.csdn\.net/

   NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   如果沒有在 regex-urlfilter.txt 文件中指定域名，會導致在種子 URL 頁面中鏈接的所有 URL 都會被爬取，數量不可控。



4. 使用單步命令進行全網爬取 (Using Individual Commands for Whole-Web Crawling)
-------------------------------------------------------------------------------------------------------------------------
全網爬取 (Whole-Web Crawling) 是設計用於在多部機器上運行來處理非常大規模的爬取任務，可能需要幾周的時間才能完成。單步命令方
式可以在爬取過程中進行更多的控制，分步驟執行爬取。應該注意，全網爬取並不意味要對整個全球 web 網絡進行爬取。可以使用過濾器等
手段對要爬取的 URL 資源列表進行限制，使其限定在一個可控的範圍內，將不感興趣的資源忽略掉，方法如上節所述。

   NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   這類任務耗費資源和時間都非常大，本節只對相關概念進行論述，爬行操作使用上一節配置的小範圍 URL 種子列表和過濾器執行。

4.1 分步執行：概念(Step-by-Step: Concepts)
-------------------------------------------------------------------------------------------------------------------------
nutch 數據庫由下列部分組成：

① crawl 數據庫，或稱爲 crawldb   ：包含 nutch 所瞭解的每一個 URL 的信息，包括其是否被抓取，以及如果抓取過了，抓取的時間等
   相關信息。

② link 數據庫，或稱爲 linkdb   : 這個數據庫包含所有已知的鏈接 URL，包括源 URL(source URL) 和鏈接文本。

③ 一系列的段數據(A set of segments): 每一個段數據是一組 URL, 作爲一個抓取單元。每個段(segment)是一個目錄，其中包含如下
子目錄：

   ● crawl_generate   ：包含一組要抓取的 URL，文件類型爲 SequenceFile
   ● crawl_fetch       ：包含正在抓取的每一個 URL 的狀態信息，文件類型爲 MapFile
   ● content           ：包含從每一個 URL 獲取到的原始內容(raw content)，文件類型爲 MapFile
   ● parse_text       : 包含每一個 URL 解析後的文本，文件類型爲 MapFile
   ● parse_data       ：包含從每一個 URL 中解析出來的所有外鏈接(outlinks)和元數據信息，文件類型爲 MapFile
   ● crawl_parse       : 包含外鏈接(outlink) URL, 用於更新 crawldb，文件類型爲 SequenceFile

4.2 分步執行：將初始 URL 種子列表注入 crawldb (Step-by-Step: Seeding the crawldb with a list of URLs)
-------------------------------------------------------------------------------------------------------------------------
這是分步執行的第一步，使用 injector 命令將 URL 種子列表添加到 crawldb 數據庫中。執行這一過程很簡單，但也是非常重要的一步。
對於全網爬取的 URL 種子列表，應該精心選擇。

傳統的 DMOZ 曾經是各種網絡爬蟲和搜索引擎所青睞的全球 URL 網址列表，但其網站 http://www.dmoz.org/ 已於 2017 年 3 月 14 日
永久關閉，不過目前有很多類似的商業化 DMOZ 網站興起，例如：

   http://www.dmoztools.net
   http://www.dmozdir.org
   http://www.chinadmoz.org
   http://www.webdmoz.org/
   http://www.cdmoz.com/
   ...

其中收錄了大量的 URL 網址，可以在其中找尋到並獲取適用於自己全網爬取任務的 URL 列表，通過直接使用 nutch 程序，或者編寫一個
簡單的腳本程序，即可獲取到某個 DMOZ 網站上收錄的所有 URL 來作爲 nutch 的初始 URL 種子列表。

4.2.1 引導注入：
-------------------------------------------------------------------------------------------------------------------------
準備好 URL 種子列表之後，可以進行 nutch 爬取的引導注入，命令爲 nutch inject。

   NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   nutch 中，執行沒有任何命令的 nutch, 獲取 nutch 所有命令的簡要說明。執行無任何參數和選項的命令獲取該命令的用法。

命令簡要說明：
inject：inject new urls into the database

命令用法：
[devalone@nutch nutch]$ bin/nutch inject
Usage: Injector [-D...] <crawldb> <url_dir> [-overwrite|-update] [-noFilter] [-noNormalize] [-filterNormalizeAll]

      <crawldb>     Path to a crawldb directory. If not present, a new one would be created.
      <url_dir>     Path to URL file or directory with URL file(s) containing URLs to be injected.
                   A URL file should have one URL per line, optionally followed by custom metadata.
                   Blank lines or lines starting with a '#' would be ignored. Custom metadata must
                   be of form 'key=value' and separated by tabs.
                   Below are reserved metadata keys:

                           nutch.score: A custom score for a url
                           nutch.fetchInterval: A custom fetch interval for a url
                           nutch.fetchInterval.fixed: A custom fetch interval for a url that is not changed by AdaptiveFetchSchedule

                   Example:
                   http://www.apache.org/
                   http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source

-overwrite Overwite existing crawldb records by the injected records. Has precedence over 'update'
-update Update existing crawldb records with the injected records. Old metadata is preserved

   -nonormalize   Do not normalize URLs before injecting
   -nofilter      Do not apply URL filters to injected URLs
   -filterNormalizeAll
                   Normalize and filter all URLs including the URLs of existing CrawlDb records

   -D...          set or overwrite configuration property (property=value)
   -Ddb.update.purge.404=true
                   remove URLs with status gone (404) from CrawlDb

● 執行 URL 種子注入：

進入 $NUTCH_INSTALL 目錄：

   [devalone@nutch nutch]$ pwd
   /opt/nutch/nutch
   devalone@nutch nutch]$ ll
   drwxr-xr-x 2 devalone devalone     32 6月 29 16:52 bin
   -rw-rw-r-- 1 devalone devalone 106848 12月 19 2017 CHANGES.txt
   drwxr-xr-x 2 devalone devalone   4096 7月 21 16:31 conf
   drwxr-xr-x 3 devalone devalone     17 12月 19 2017 docs
   drwxr-xr-x 3 devalone devalone   8192 6月 29 16:52 lib
   -rw-rw-r-- 1 devalone devalone 329066 12月 19 2017 LICENSE.txt
   drwxrwxr-x 2 devalone devalone     24 7月 20 18:10 logs
   -rw-rw-r-- 1 devalone devalone    427 12月 19 2017 NOTICE.txt
   drwxr-xr-x 71 devalone devalone   4096 12月 19 2017 plugins
   drwxrwxr-x 2 devalone devalone     22 7月 21 16:25 urls

   執行注入：
   [devalone@nutch nutch]$ bin/nutch inject crawl/crawldb urls
   Injector: starting at 2018-07-23 10:50:25
   Injector: crawlDb: crawl/crawldb
   Injector: urlDir: urls
   Injector: Converting injected urls to crawl db entries.
   Injector: overwrite: false
   Injector: update: false
   Injector: Total urls rejected by filters: 0
   Injector: Total urls injected after normalization and filtering: 3
   Injector: Total urls injected but already in CrawlDb: 0
   Injector: Total new urls injected: 3
   Injector: finished at 2018-07-23 10:50:29, elapsed: 00:00:03

   □ 指定 crawldb 目錄爲當前目錄下的 crawl/crawldb (如果不存在，nutch 會自動創建)。
   □ 種子目錄爲當前目錄下之前創建好的 urls (目錄內已準備好 URL 種子列表文件，本例爲 seed.txt，該目錄下可以存在多個種子
   文件或子目錄，執行注入時會將該目錄及子目錄內的種子文件全部注入到 crawldb 中)。

4.2.2 驗證注入結果
-------------------------------------------------------------------------------------------------------------------------
注入完成後，查看當前目錄下內容：

   [devalone@nutch nutch]$ ll crawl
   總用量 0
   drwxrwxr-x 4 devalone devalone 32 7月 23 10:50 crawldb

crawl/crawldb 目錄已創建。

   [devalone@nutch nutch]$ ll crawl/crawldb/
   總用量 0
   drwxrwxr-x 3 devalone devalone 26 7月 23 10:50 current
   drwxrwxr-x 2 devalone devalone 6 7月 23 10:50 old

其下創建了兩個子目錄：current 和 old。繼續探查 current 目錄：

   [devalone@nutch nutch]$ ll crawl/crawldb/current/
   總用量 0
   drwxrwxr-x 2 devalone devalone 66 7月 23 10:50 part-r-00000
   [devalone@nutch nutch]$ ll crawl/crawldb/current/part-r-00000/
   總用量 8
   -rw-r--r-- 1 devalone devalone 263 7月 23 10:50 data
   -rw-r--r-- 1 devalone devalone 213 7月 23 10:50 index

已創建實際的數據文件 data 和索引文件 index，並且文件非空，說明數據已注入到數據庫中，狀態良好。

如果要進一步驗證數據庫中的數據，可以使用 readdb 命令，示例如下：

   [devalone@nutch nutch]$ bin/nutch readdb crawl/crawldb -dump crawldb-dump
   CrawlDb dump: starting
   CrawlDb db: crawl/crawldb
   CrawlDb dump: done

   [devalone@nutch nutch]$ ll crawldb-dump/
   總用量 4
   -rw-r--r-- 1 devalone devalone 744 7月 23 11:10 part-00000
   [devalone@nutch nutch]$ cat crawldb-dump/part-00000
   http://mil.sohu.com/    Version: 7
   Status: 1 (db_unfetched)
   Fetch time: Mon Jul 23 10:50:25 CST 2018
   Modified time: Thu Jan 01 08:00:00 CST 1970
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 1.0
   Signature: null
   Metadata:

   http://news.sohu.com/   Version: 7
   Status: 1 (db_unfetched)
   Fetch time: Mon Jul 23 10:50:25 CST 2018
   Modified time: Thu Jan 01 08:00:00 CST 1970
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 1.0
   Signature: null
   Metadata:

   https://blog.csdn.net/ Version: 7
   Status: 1 (db_unfetched)
   Fetch time: Mon Jul 23 10:50:25 CST 2018
   Modified time: Thu Jan 01 08:00:00 CST 1970
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 1.0
   Signature: null
   Metadata:

3 個初始種子 URL 存在於數據庫中，狀態良好。

4.3 分步執行：抓取 (Step-by-Step: Fetching)
-------------------------------------------------------------------------------------------------------------------------

4.3.1 生成抓取列表
-------------------------------------------------------------------------------------------------------------------------
要執行抓取操作，首先要從爬取數據庫中生成抓取列表，命令爲 nutch generate

命令簡要說明：
generate：generate new segments to fetch from crawl db

命令用法:
Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-expr <expr>] \
[-adddays <numDays>] [-noFilter] [-noNorm] [-maxNumSegments <num>]

生成抓取列表：
   [devalone@nutch nutch]$ bin/nutch generate crawl/crawldb crawl/segments
   Generator: starting at 2018-07-23 11:31:13
   Generator: Selecting best-scoring urls due for fetch.
   Generator: filtering: true
   Generator: normalizing: true
   Generator: running in local mode, generating exactly one partition.
   Generator: Partitioning selected urls for politeness.
   Generator: segment: crawl/segments/20180723113116
   Generator: finished at 2018-07-23 11:31:17, elapsed: 00:00:04

   [devalone@nutch nutch]$ ll crawl
   總用量 0
   drwxrwxr-x 4 devalone devalone 32 7月 23 11:31 crawldb
   drwxrwxr-x 3 devalone devalone 28 7月 23 11:31 segments

   [devalone@nutch nutch]$ ll crawl/segments/
   總用量 0
   drwxrwxr-x 3 devalone devalone 28 7月 23 11:31 20180723113116

爲所有要抓取的頁面生成了一個抓取列表(a fetch list)。抓取列表位於新創建的段目錄(segment directory)內。段目錄是按創建時間命
名的，我們將這個段名保存到 shell 變量 s1 中，以方便後續使用：

   [devalone@nutch nutch]$ s1=`ls -d crawl/segments/2* | tail -1`
   [devalone@nutch nutch]$ echo $s1
   crawl/segments/20180723113116

4.3.2 執行抓取任務
-------------------------------------------------------------------------------------------------------------------------
執行頁面抓取任務的命令是 nutch fetch。

命令簡要說明：
fetch: fetch a segment's pages

命令用法:
Usage: Fetcher <segment> [-threads n]

執行抓取任務：將環境變量中存儲的段名作爲參數傳遞給 fetch 命令，採用默認線程數量
   [devalone@nutch nutch]$ bin/nutch fetch $s1
   Fetcher: starting at 2018-07-23 11:53:50
   Fetcher: segment: crawl/segments/20180723113116
   Fetcher: threads: 10
   Fetcher: time-out divisor: 2
   ...

Fetcher: finished at 2018-07-23 11:53:54, elapsed: 00:00:03

抓取完畢。從輸出中可以看出，默認的抓取線程數量爲 10。

4.3.3 執行解析任務
-------------------------------------------------------------------------------------------------------------------------
抓取到的頁面需要進行解析才能存儲到數據庫中。執行解析的命令爲 nutch parse。

命令簡要說明：
parse: parse a segment's pages

命令用法:
Usage: ParseSegment segment [-noFilter] [-noNormalize]

執行解析任務：將段名作爲參數傳遞給 parse 命令:
   [devalone@nutch nutch]$ nutch parse $s1
   ParseSegment: starting at 2018-07-23 12:06:29
   ParseSegment: segment: crawl/segments/20180723113116
   Parsed (234ms):http://mil.sohu.com/
   Parsed (46ms):http://news.sohu.com/
   Parsed (35ms):https://blog.csdn.net/
   ParseSegment: finished at 2018-07-23 12:06:31, elapsed: 00:00:01

3 個抓取到的頁面解析完成。

4.3.4 更新 crawldb 數據庫
-------------------------------------------------------------------------------------------------------------------------
解析任務完成後，將抓取結果更新到 crawldb 數據庫，這是爬取循環的最後一步。更新數據庫的命令爲 nutch updatedb.

命令簡要說明：
updatedb: update crawl db from segments after fetching

命令用法:
       Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions]
        crawldb CrawlDb to update
        -dir segments   parent directory containing all segments to update from
        seg1 seg2 ...   list of segment names to update from
        -force force update even if CrawlDb appears to be locked (CAUTION advised)
        -normalize      use URLNormalizer on urls in CrawlDb and segment (usually not needed)
        -filter use URLFilters on urls in CrawlDb and segment
        -noAdditions    only update already existing URLs, don't add any newly discovered URLs

更新 crawldb 數據庫：將 crawldb 目錄和段名作爲參數傳遞給 updatedb 命令

   [devalone@nutch nutch]$ nutch updatedb crawl/crawldb $s1
   CrawlDb update: starting at 2018-07-23 12:15:17
   CrawlDb update: db: crawl/crawldb
   CrawlDb update: segments: [crawl/segments/20180723113116]
   CrawlDb update: additions allowed: true
   CrawlDb update: URL normalizing: false
   CrawlDb update: URL filtering: false
   CrawlDb update: 404 purging: false
   CrawlDb update: Merging segment data into db.
   CrawlDb update: finished at 2018-07-23 12:15:20, elapsed: 00:00:02

更新完成。現在 crawldb 數據庫中含有所有最初的種子 URL 列表頁面更新後的信息，以及從初始 URL 頁面中新發現的鏈接信息。通過
readdb 命令簡單查看下數據庫中的內容（如果使用已存在的目錄作爲輸出目錄，需要先將其刪除，否則報錯，這是所有 hadoop 相關項目
的共性）：

   [devalone@nutch nutch]$ rm -rf crawldb-dump/
   [devalone@nutch nutch]$ bin/nutch readdb crawl/crawldb -dump crawldb-dump
   CrawlDb dump: starting
   CrawlDb db: crawl/crawldb
   CrawlDb dump: done

   [devalone@nutch nutch]$ ll
   [devalone@nutch nutch]$ ll crawldb-dump/
   總用量 16
   -rw-r--r-- 1 devalone devalone 15862 7月 23 12:22 part-00000

   [devalone@nutch nutch]$ cat crawldb-dump/part-00000
   http://mil.sohu.com/    Version: 7
   Status: 2 (db_fetched)
   Fetch time: Wed Aug 22 11:53:53 CST 2018
   Modified time: Mon Jul 23 11:53:53 CST 2018
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 1.75
   Signature: 89f36893e295bc0b08f1b1e36bb0c49d
   Metadata:
           _pst_=success(1), lastModified=0
           _rs_=16
           Content-Type=text/html
           nutch.protocol.code=200

   http://mil.sohu.com/1468        Version: 7
   Status: 1 (db_unfetched)
   Fetch time: Mon Jul 23 12:15:19 CST 2018
   Modified time: Thu Jan 01 08:00:00 CST 1970
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 0.16666667
   Signature: null
   Metadata:

   http://mil.sohu.com/1469        Version: 7
   Status: 1 (db_unfetched)
   Fetch time: Mon Jul 23 12:15:19 CST 2018
   Modified time: Thu Jan 01 08:00:00 CST 1970
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 0.16666667
   Signature: null
   Metadata:

...

可以發現，數據庫中的內容比之前的只有 3 條記錄多出很多。可以使用 readdb 的 -stats 選項查看數據庫的統計信息：

   [devalone@nutch nutch]$ bin/nutch readdb crawl/crawldb -stats
   CrawlDb statistics start: crawl/crawldb
   Statistics for CrawlDb: crawl/crawldb
   TOTAL urls:     57
   shortest fetch interval:        30 days, 00:00:00
   avg fetch interval:     30 days, 00:00:00
   longest fetch interval: 30 days, 00:00:00
   earliest fetch time:    Mon Jul 23 12:15:00 CST 2018
   avg of fetch times:     Wed Jul 25 02:07:00 CST 2018
   latest fetch time:      Wed Aug 22 11:53:00 CST 2018
   retry 0:        57
   score quantile 0.01:    0.010204081423580647
   score quantile 0.05:    0.010204081423580647
   score quantile 0.1:     0.010204081423580647
   score quantile 0.2:     0.010204081423580647
   score quantile 0.25:    0.010204081423580647
   score quantile 0.3:     0.016326530277729012
   score quantile 0.4:     0.020408162847161293
   score quantile 0.5:     0.020408162847161293
   score quantile 0.6:     0.030612245202064514
   score quantile 0.7:     0.030612245202064514
   score quantile 0.75:    0.030612245202064514
   score quantile 0.8:     0.0346938773989678
   score quantile 0.9:     0.1666666716337204
   score quantile 0.95:    0.750765299797057
   score quantile 0.99:    1.7033333361148832
   min score:      0.010204081423580647
   avg score:      0.10526320808812191
   max score:      1.75
   status 1 (db_unfetched):        54
   status 2 (db_fetched): 3
   CrawlDb statistics: done

可以看出，當前 crawldb 中共有 57 個 URL，其中 3 個狀態爲 status 2 (db_fetched)，即最初的種子列表中的 3 URL 已經抓取完畢。
有 54 個 URL狀態爲 status 1 (db_unfetched)，是從最初的 3 個 URL頁面中新發現的連接，還沒有執行頁面抓取，處於 db_unfetched
狀態。

4.3.5 執行第二次次抓取過程
-------------------------------------------------------------------------------------------------------------------------
爲了獲取足夠多的測試數據，再次執行抓取過程，這次生成並抓取包含評分前 1000(top-scoring 1000) 的頁面：其實當前數據庫中只有
54 個未抓取的頁面，這個選項在當前數據庫中執行生成段的操作並未起作用，加上也無妨。

[devalone@nutch nutch]$ nutch generate crawl/crawldb crawl/segments -topN 1000

   [devalone@nutch nutch]$ s2=`ls -d crawl/segments/2* | tail -1`
   [devalone@nutch nutch]$ echo $s2
   crawl/segments/20180723124830

   [devalone@nutch nutch]$ nutch fetch $s2
   Fetcher: starting at 2018-07-23 12:50:27
   Fetcher: segment: crawl/segments/20180723124830
   Fetcher: threads: 10

   ...
   Fetcher: finished at 2018-07-23 12:54:37, elapsed: 00:04:10

   用時 00:04:10

   [devalone@nutch nutch]$ nutch parse $s2
   [devalone@nutch nutch]$ nutch updatedb crawl/crawldb $s2

查看數據庫狀態：
   [devalone@nutch nutch]$ bin/nutch readdb crawl/crawldb -stats
   CrawlDb statistics start: crawl/crawldb
   Statistics for CrawlDb: crawl/crawldb
   TOTAL urls:     1361
   shortest fetch interval:        30 days, 00:00:00
   avg fetch interval:     30 days, 00:15:52
   longest fetch interval: 45 days, 00:00:00
   earliest fetch time:    Mon Jul 23 12:57:00 CST 2018
   avg of fetch times:     Tue Jul 24 18:51:00 CST 2018
   latest fetch time:      Thu Sep 06 12:50:00 CST 2018
   retry 0:        1360
   retry 1:        1
   score quantile 0.01:    1.6222300565069806E-4
   score quantile 0.05:    2.0923737523844465E-4
   score quantile 0.1:     2.170994079538754E-4
   score quantile 0.2:     3.1386775372084226E-4
   score quantile 0.25:    3.2318823286914267E-4
   score quantile 0.3:     3.3286939081000655E-4
   score quantile 0.4:     4.436204064404592E-4
   score quantile 0.5:     6.01460276577141E-4
   score quantile 0.6:     6.462093988934962E-4
   score quantile 0.7:     8.13645781966409E-4
   score quantile 0.75:    9.000660493791871E-4
   score quantile 0.8:     9.7666619066149E-4
   score quantile 0.9:     0.0013502709909944855
   score quantile 0.95:    0.0023547878954559565
   score quantile 0.99:    0.03270121983119422
   min score:      1.0855405707843602E-4
   avg score:      0.005894269950244464
   max score:      1.8964647054672241
   status 1 (db_unfetched):        1305
   status 2 (db_fetched): 55
   status 3 (db_gone):     1
   CrawlDb statistics: done

URL 總數已增加到 1361 個，其中有 55 個已抓取，1305 個新發現的鏈接未抓取，1 個處於 status 3 (db_gone) 狀態。

4.3.6 執行第三次抓取過程
-------------------------------------------------------------------------------------------------------------------------
數據庫中有 1305 個未抓取的 URL, 這裏的 -topN 1000 會起作用，限制其生成評分前 1000 個 URL 的段信息。

   [devalone@nutch nutch]$ nutch generate crawl/crawldb crawl/segments -topN 1000
   [devalone@nutch nutch]$ ll crawl/segments/
   總用量 0
   drwxrwxr-x 8 devalone devalone 117 7月 23 12:06 20180723113116
   drwxrwxr-x 8 devalone devalone 117 7月 23 12:56 20180723124830
   drwxrwxr-x 3 devalone devalone 28 7月 23 13:04 20180723130401
   [devalone@nutch nutch]$ s3=`ls -d crawl/segments/2* | tail -1`
   [devalone@nutch nutch]$ echo $s3
   crawl/segments/20180723130401

   [devalone@nutch nutch]$ nutch fetch $s3
   Fetcher: starting at 2018-07-23 13:06:07
   Fetcher: segment: crawl/segments/20180723130401
   Fetcher: threads: 10
   ...

Fetcher: finished at 2018-07-23 14:32:43, elapsed: 01:26:36

此次抓取用時 01:26:36

   NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   對於這種耗時的操作，可以聯合 nohup 命令來執行，示例：

       [devalone@nutch nutch]$ nohup nutch fetch $s3 &

   [devalone@nutch nutch]$ nutch parse $s3
   [devalone@nutch nutch]$ nutch updatedb crawl/crawldb $s3

   [devalone@nutch nutch]$ nutch readdb crawl/crawldb -stats
   CrawlDb statistics start: crawl/crawldb
   Statistics for CrawlDb: crawl/crawldb
   TOTAL urls:     18442
   shortest fetch interval:        30 days, 00:00:00
   avg fetch interval:     30 days, 00:01:10
   longest fetch interval: 45 days, 00:00:00
   earliest fetch time:    Mon Jul 23 12:57:00 CST 2018
   avg of fetch times:     Wed Jul 25 07:35:00 CST 2018
   latest fetch time:      Thu Sep 06 12:50:00 CST 2018
   retry 0:        18433
   retry 1:        9
   score quantile 0.01:    3.363983068993548E-6
   score quantile 0.05:    4.0650803027780635E-6
   score quantile 0.1:     5.252996839943921E-6
   score quantile 0.2:     8.687948599238534E-6
   score quantile 0.25:    1.0351363643525419E-5
   score quantile 0.3:     1.1638136404260567E-5
   score quantile 0.4:     1.5087516350721173E-5
   score quantile 0.5:     1.956822216838666E-5
   score quantile 0.6:     2.4080784436340625E-5
   score quantile 0.7:     2.9829692724346616E-5
   score quantile 0.75:    3.42503072351974E-5
   score quantile 0.8:     4.092971121440248E-5
   score quantile 0.9:     8.750581586759325E-5
   score quantile 0.95:    4.3720556372452434E-4
   score quantile 0.99:    0.001597540448419741
   min score:      0.0
   avg score:      4.8001176325052655E-4
   max score:      1.8964647054672241
   status 1 (db_unfetched):        17394
   status 2 (db_fetched): 1039
   status 3 (db_gone):     1
   status 4 (db_redir_temp):       2
   status 5 (db_redir_perm):       6
   CrawlDb statistics: done

URL 總數已增加到 18442 個，其中有 1039 個已抓取，17394 個新發現的鏈接未抓取，1 個處於 status 3 (db_gone) 狀態。

■ 驗證 nutch 是否抓取了 https 協議的 URL 頁面
-------------------------------------------------------------------------------------------------------------------------
dump 出狀態爲 db_fetched 的 URL:

   [devalone@nutch nutch]$ nutch readdb crawl/crawldb/ -dump crawldb-dump -status db_fetched
   CrawlDb dump: starting
   CrawlDb db: crawl/crawldb/
   CrawlDb dump: done

[devalone@nutch nutch]$ ll crawldb-dump/part-00000
-rw-r--r-- 1 devalone devalone 437326 7月 24 16:12 crawldb-dump/part-00000

查看 dump 出的數據庫內容：
   [devalone@nutch nutch]$ cat crawldb-dump/part-00000
   https://blog.csdn.net/10km/article/details/81147355     Version: 7
   Status: 2 (db_fetched)
   Fetch time: Wed Aug 22 13:32:49 CST 2018
   Modified time: Mon Jul 23 13:32:49 CST 2018
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 6.664876E-4
   Signature: b59a0e8cd511f1090489db04f3518a2b
   Metadata:
           _pst_=success(1), lastModified=1531722549000
           _rs_=147
           Content-Type=text/html
           nutch.protocol.code=200

   https://blog.csdn.net/A_B_C_D_E______/article/details/81142260 Version: 7
   Status: 2 (db_fetched)
   Fetch time: Wed Aug 22 13:11:52 CST 2018
   Modified time: Mon Jul 23 13:11:52 CST 2018
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 3.4539928E-4
   Signature: 97a758b6c2f999204168cbc817dd1a88
   Metadata:
           _pst_=success(1), lastModified=1531722549000
           _rs_=156
           Content-Type=text/html
           nutch.protocol.code=200

   https://blog.csdn.net/Ancony_/article/details/81141073 Version: 7
   Status: 2 (db_fetched)
   Fetch time: Wed Aug 22 14:17:03 CST 2018
   Modified time: Mon Jul 23 14:17:03 CST 2018
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 3.4383315E-4
   Signature: f9050a1c2ff299c7313d052114a4115d
   Metadata:
           _pst_=success(1), lastModified=1531722549000
           _rs_=175
           Content-Type=text/html
           nutch.protocol.code=200

...

可以看出，數據庫中含有相當一部分已抓取到的 https 協議的 URL，說明默認的 plugin.includes 屬性設置可以處理 http 和 https
協議的 URL，不需要進一步包含 protocol-httpclient。

   ● NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   事實上，plugin.includes 的屬性值是一個正則表達式，該屬性在默認配置文件nutch-default.xml 中設置爲：

   <property>
      <name>plugin.includes</name>
      <value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|\
      urlnormalizer-(pass|regex|basic)</value>
      <description>Regular expression naming plugin directory names to
      include. Any plugin not matching this expression is excluded.
      In any case you need at least include the nutch-extensionpoints plugin. By
      default Nutch includes crawling just HTML and plain text via HTTP,
      and basic indexing and search plugins. In order to use HTTPS please enable
      protocol-httpclient, but be aware of possible intermittent problems with the
      underlying commons-httpclient library. Set parsefilter-naivebayes for classification based focused crawler.
      </description>
   </property>


   其中 protocol-http| 部分已經包含了對 protocol-httpclient 字符串的匹配。因此，默認情況下，nutch 能夠直接載入並使用
   protocol-httpclient 插件來處理 https 協議, 只是該屬性的描述(<description> 元素內容) 對 protocol-httpclient 插件的說明
   令人困惑。這應該是以前版本遺留下來的一個文檔 bug，除了會給使用者造成困惑之外，不會影響 nutch 的執行，因此不必理會。

4.4 分步執行：消除重複 URL (Step-by-Step: Deleting Duplicates)
-------------------------------------------------------------------------------------------------------------------------
完成數據抓取任務之後，需要解決重複 URL 的問題，以確保 crawldb 數據庫中 URL 的唯一性，這樣就保證了建立索引時 URL 是唯一的。

消除重複 URL 使用 nutch dedup 命令。

命令簡要說明：

   dedup：deduplicate entries in the crawldb and give them a special status

詳細描述：
-------------------------------------------------------------------------------------------------------------------------
The dedup command is an alias for the class org.apache.nutch.crawl.DeduplicationJob and is available since Nutch 1.8 (Not
yet in Nutch 2.x as of May 2014，即截止 2014 年5月，Nutch 2.x 中還沒有 dedup 命令).

This command takes a path to a crawldb as parameter and finds duplicates based on the signature. If several entries share
the same signature, the one with the highest score is kept. If the scores are the same, then the fetch time is used to
determine which one to keep with the most recent one being kept. If their fetch times are the same we keep the one with the
shortest URL. The entries which are not kept have their status changed to STATUS_DB_DUPLICATE, this is then used by the
Cleaning and Indexing jobs to delete the corresponding documents in the backends (SOLR, Elasticsearch).

The dedup command replaces the SOLR dedup command which was limited to SOLR and was a lot less efficient.

dedup 命令從 Nutch 1.8 開始替換 SOLR dedup 命令，用法上也有所不同。應該在建立 Solr 索引之前執行，以保證 URL在 crawldb 數據
庫和 Solr 索引中的唯一性。

命令用法：
   Usage: DeduplicationJob <crawldb> [-group <none|host|domain>] [-compareOrder <score>,<fetchTime>,<urlLength>]


消除重複：

   [devalone@nutch nutch]$ nutch dedup crawl/crawldb
   DeduplicationJob: starting at 2018-07-24 12:10:16
   Deduplication: 1 documents marked as duplicates
   Deduplication: Updating status of duplicate urls into crawl db.
   Deduplication finished at 2018-07-24 12:10:22, elapsed: 00:00:05

查看 crawldb 數據庫此時的狀態：
   [devalone@nutch nutch]$ nutch readdb crawl/crawldb -stats
   CrawlDb statistics start: crawl/crawldb
   Statistics for CrawlDb: crawl/crawldb
   TOTAL urls:     18442
   shortest fetch interval:        30 days, 00:00:00
   avg fetch interval:     30 days, 00:01:10
   longest fetch interval: 45 days, 00:00:00
   earliest fetch time:    Mon Jul 23 12:57:00 CST 2018
   avg of fetch times:     Wed Jul 25 07:35:00 CST 2018
   latest fetch time:      Thu Sep 06 12:50:00 CST 2018
   retry 0:        18433
   retry 1:        9
   score quantile 0.01:    3.363983068993548E-6
   score quantile 0.05:    4.0650803027780635E-6
   score quantile 0.1:     5.252996839943921E-6
   score quantile 0.2:     8.687948599238534E-6
   score quantile 0.25:    1.0351363643525419E-5
   score quantile 0.3:     1.1638136404260567E-5
   score quantile 0.4:     1.5087516350721173E-5
   score quantile 0.5:     1.956822216838666E-5
   score quantile 0.6:     2.4080784436340625E-5
   score quantile 0.7:     2.9829692724346616E-5
   score quantile 0.75:    3.42503072351974E-5
   score quantile 0.8:     4.092971121440248E-5
   score quantile 0.9:     8.750581586759325E-5
   score quantile 0.95:    4.3720556372452434E-4
   score quantile 0.99:    0.001597540448419741
   min score:      0.0
   avg score:      4.8001176325052655E-4
   max score:      1.8964647054672241
   status 1 (db_unfetched):        17394
   status 2 (db_fetched): 1038
   status 3 (db_gone):     1
   status 4 (db_redir_temp):       2
   status 5 (db_redir_perm):       6
   status 7 (db_duplicate):        1
   CrawlDb statistics: done

此時已抓取的 URL 變爲 1038 個，出現 1 個重複的 URL: status 7 (db_duplicate)。

4.5 分步執行：反轉鏈接 (Step-by-Step: Invertlinks)
-------------------------------------------------------------------------------------------------------------------------
在建立索引之前，首先需要反轉所有的鏈接，這樣才能爲頁面的錨點文本建立索引。反轉鏈接使用命令 invertlinks 。

命令簡要說明：
   invertlinks：create a linkdb from parsed segments

命令用法：
       Usage: LinkDb <linkdb> (-dir <segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize] [-noFilter]
        linkdb output LinkDb to create or update
        -dir segmentsDir        parent directory of several segments, OR
        seg1 seg2 ...    list of segment directories
        -force force update even if LinkDb appears to be locked (CAUTION advised)
        -noNormalize    don't normalize link URLs
        -noFilter       don't apply URLFilters to link URLs

反轉鏈接：指定鏈接數據庫爲 crawl/linkdb，段目錄爲 crawl/segments

   [devalone@nutch nutch]$ nutch invertlinks crawl/linkdb -dir crawl/segments
   LinkDb: starting at 2018-07-23 15:19:10
   LinkDb: linkdb: crawl/linkdb
   LinkDb: URL normalize: true
   LinkDb: URL filter: true
   LinkDb: internal links will be ignored.
   LinkDb: adding segment: file:/opt/nutch/apache-nutch-1.14/crawl/segments/20180723113116
   LinkDb: adding segment: file:/opt/nutch/apache-nutch-1.14/crawl/segments/20180723124830
   LinkDb: adding segment: file:/opt/nutch/apache-nutch-1.14/crawl/segments/20180723130401
   LinkDb: finished at 2018-07-23 15:19:13, elapsed: 00:00:02

4.6 分步執行：爲 Apache Solr 建立索引 (Step-by-Step: Indexing into Apache Solr)
-------------------------------------------------------------------------------------------------------------------------
在執行這一步之前，應該安裝好 Solr。

   NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   後面第 6 節會討論 Apache Solr 的安裝過程，現在假設已經安裝完畢。

爲 Apache Solr 建立索引使用 index 命令

命令簡要說明：
index: run the plugin-based indexer on parsed segments and linkdb

命令用法：
   Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit]\
   [-del    eteGone] [-filter] [-normalize] [-addBinaryContent] [-base64]
   Active IndexWriters :
   SOLRIndexWriter
           solr.server.url : URL of the SOLR instance
           solr.zookeeper.hosts : URL of the Zookeeper quorum
           solr.commit.size : buffer size when sending to SOLR (default 1000)
           solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
           solr.auth : use authentication (default false)
           solr.auth.username : username for authentication
           solr.auth.password : password for authentication

建立索引：將 Solr server URL 以 Java 參數 -D key=value 的形式傳遞給 index 命令:

-Dsolr.server.url=http://localhost:8983/solr/nutch

執行過程如下所示：
-------------------------------------------------------------------------------------------------------------------------
[devalone@nutch nutch]$ bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/nutch crawl/crawldb -linkdb \
crawl/linkdb -dir crawl/segments/ -filter -normalize -deleteGone

   Segment dir is complete: file:/opt/nutch/apache-nutch-1.14/crawl/segments/20180723113116.
   Segment dir is complete: file:/opt/nutch/apache-nutch-1.14/crawl/segments/20180723124830.
   Segment dir is complete: file:/opt/nutch/apache-nutch-1.14/crawl/segments/20180723130401.
   Indexer: starting at 2018-07-23 17:56:53
   Indexer: deleting gone documents: true
   Indexer: URL filtering: true
   Indexer: URL normalizing: true
   Active IndexWriters :
   SOLRIndexWriter
           solr.server.url : URL of the SOLR instance
           solr.zookeeper.hosts : URL of the Zookeeper quorum
           solr.commit.size : buffer size when sending to SOLR (default 1000)
           solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
           solr.auth : use authentication (default false)
           solr.auth.username : username for authentication
           solr.auth.password : password for authentication

   Indexing 250/250 documents
   Deleting 0 documents
   SolrIndexer: deleting 1/1 documents
   Indexing 250/500 documents
   Deleting 0 documents
   Indexing 250/750 documents
   Deleting 0 documents
   SolrIndexer: deleting 4/5 documents
   Indexing 250/1000 documents
   Deleting 0 documents
   SolrIndexer: deleting 4/9 documents
   Indexing 27/1027 documents
   Deleting 0 documents
   Indexer: number of documents indexed, deleted, or skipped:
   Indexer:      1 deleted (gone)
   Indexer:      8 deleted (redirects)
   Indexer:   1027 indexed (add/update)
   Indexer: finished at 2018-07-23 17:57:14, elapsed: 00:00:21

索引建立完成之後，可以通過如下鏈接執行搜索：

http://localhost:8983/solr/#/nutch/query

或通過域名訪問：
http://nutch.sansovo.org:8983/solr/#/nutch/query

4.7 分步執行：清理 Solr (Step-by-Step: Cleaning Solr)
-------------------------------------------------------------------------------------------------------------------------
單步執行的最後一步，是對 crawldb 和 Solr 索引進行清理。這也是在更新 crawldb 之後經常做的操作。清理使用 nutch clean 命令。

命令簡要說明：

clean ：remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins

命令用法：
   Usage: CleaningJob <crawldb> [-noCommit]
   Active IndexWriters :
   SOLRIndexWriter
           solr.server.url : URL of the SOLR instance
           solr.zookeeper.hosts : URL of the Zookeeper quorum
           solr.commit.size : buffer size when sending to SOLR (default 1000)
           solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
           solr.auth : use authentication (default false)
           solr.auth.username : username for authentication
           solr.auth.password : password for authentication

執行 clean 命令的 Java class 是 CleaningJob 。該類掃描整個 crawldb 目錄查找狀態爲 DB_GONE (404/301) 的 URL 條目，並向 Solr
發送刪除請求。Solr 收到上述請求之後，及時刪除這些文檔。如此維護了 Solr 索引的健康質量。

清理 Solr：
   [devalone@nutch nutch]$ nutch clean crawl/crawldb -Dsolr.server.url=http://nutch.sansovo.org:8983/solr/nutch
   SolrIndexer: deleting 2/2 documents
   SolrIndexer: deleting 2/2 documents
   ...

5. 使用 crawl 腳本 (Using the crawl script)
-------------------------------------------------------------------------------------------------------------------------
crawl 腳本集成了上述單步執行的各個步驟，在一個可執行腳本中執行各步驟的命令。即在爬取過程中反覆執行如下過程：

inject->generate->fetch->parse->updatedb

對於非全網的小規模爬取，完全可以使用 crawl 命令完成整個任務。

命令用法：
   Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] [-s <Seed Dir>] <Crawl Dir> <Num Rounds>
           -i|--index      Indexes crawl results into a configured indexer
           -D              A Java property to pass to Nutch calls
           -w|--wait       NUMBER[SUFFIX] Time to wait before generating a new segment when no URLs
                           are scheduled for fetching. Suffix can be: s for second,
                           m for minute, h for hour and d for day. If no suffix is
                           specified second is used by default.
           -s Seed Dir     Path to seeds file(s)
           Crawl Dir       Directory where the crawl/link/segments dirs are saved
           Num Rounds      The number of rounds to run this crawl for

其中最後一個參數 <Num Rounds> 迭代次數是一個數字，用於指定執行爬取過程的迭代次數。數字越大執行爬取過程次數越多，抓取 URL
頁面也就越多，抓取深度和廣度更加龐大，耗費的資源和時間呈幾何數量增長。因此在測試環境中不宜設得過大，2~3 次就可以了。

5.1 準備測試環境
-------------------------------------------------------------------------------------------------------------------------
爲了有一個乾淨的測試環境，指定 <Crawl Dir> 爲新的 testcrawl 目錄。爲 solr 創建一個不同的 core, 名稱爲 crawl:

   [devalone@nutch solr]$ cd server/solr/configsets/
   [devalone@nutch configsets]$ cp -r nutch crawl
   [devalone@nutch solr]$ ll
   總用量 12
   drwxr-xr-x 7 devalone devalone 122 7月 24 14:05 configsets
   drwxrwxr-x 2 devalone devalone    6 7月 24 14:03 crawl
   drwxrwxr-x 4 devalone devalone   53 7月 23 17:19 nutch
   -rw-r--r-- 1 devalone devalone 3037 5月 18 17:59 README.txt
   -rw-r--r-- 1 devalone devalone 2117 2月 22 05:53 solr.xml
   -rw-r--r-- 1 devalone devalone 975 2月 22 05:53 zoo.cfg

   [devalone@nutch solr]$ cd $SOLR_INSTALL

做一次刪除操作，清理環境：
[devalone@nutch solr]$ solr delete -c crawl

Deleting core 'crawl' using command:
http://localhost:8983/solr/admin/cores?action=UNLOAD&core=crawl&deleteIndex=true&deleteDataDir=true&deleteInstanceDir=true

   {"responseHeader":{
       "status":0,
       "QTime":9}}

創建 crawl core:
[devalone@nutch solr]$ bin/solr create -c crawl -d server/solr/configsets/crawl/conf/

Copying configuration to new core instance directory:
/opt/solr/solr/server/solr/crawl

Creating new core 'crawl' using command:
http://localhost:8983/solr/admin/cores?action=CREATE&name=crawl&instanceDir=crawl

   {
      "responseHeader":{
       "status":0,
       "QTime":387},
      "core":"crawl"}

訪問鏈接 http://nutch.sansovo.org:8983/solr/#/crawl/query 執行查詢任務，應該返回空，因爲新創建的 crawl core還沒有建立索引。

得到 Solr server URL:

   -Dsolr.server.url=http://localhost:8983/solr/crawl/
   或：
   -Dsolr.server.url=http://nutch.sansovo.org:8983/solr/crawl/

5.2 運行 crawl 腳本
-------------------------------------------------------------------------------------------------------------------------
依然使用之前定義的種子文件 urls/seed.txt，設置迭代次數 <Num Rounds> 爲 3 。

由於運行時間可能會比較長，使用 nohup 命令來執行 crawl 腳本：

[devalone@nutch nutch]$ nohup crawl -i -D solr.server.url=http://nutch.sansovo.org:8983/solr/crawl/ -s urls/ testcrawl 3 &
[1] 16512
[devalone@nutch nutch]$ nohup: 忽略輸入並把輸出追加到"nohup.out"

nohup 進入後臺運行，並將屏幕輸出追加到當前目錄下的 nohup.out 文件：

   [devalone@nutch nutch]$ ll nohup.out
   -rw------- 1 devalone devalone 21470 7月 24 14:36 nohup.out
   [devalone@nutch nutch]$ ll nohup.out
   -rw------- 1 devalone devalone 65054 7月 24 14:41 nohup.out

可以看到 nohup.out 日誌文件持續增加。

查看最新執行信息：
   [devalone@nutch nutch]$ tail nohup.out
   -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=815, fetchQueues.getQueueCount=1
   FetcherThread 48 fetching https://blog.csdn.net/X8i0Bev/article/details/81091490 (queue crawl delay=5000ms)
   -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=814, fetchQueues.getQueueCount=1
   -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=814, fetchQueues.getQueueCount=1
   -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=814, fetchQueues.getQueueCount=1
   -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=814, fetchQueues.getQueueCount=1
   -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=814, fetchQueues.getQueueCount=1
   FetcherThread 48 fetching https://blog.csdn.net/h8y0bdjvukwe1lbozle/article/details/81117214 (queue crawl delay=5000ms)
   -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=813, fetchQueues.getQueueCount=1
   -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=813, fetchQueues.getQueueCount=1

查看數據庫目錄：

   [devalone@nutch nutch]$ ll testcrawl/
   總用量 0
   drwxrwxr-x 4 devalone devalone 32 7月 24 14:40 crawldb
   drwxrwxr-x 3 devalone devalone 21 7月 24 14:40 linkdb
   drwxrwxr-x 5 devalone devalone 72 7月 24 14:40 segments

數據庫目錄 testcrawl 被自動創建，並且在 testcrawl 內創建了 nutch 的三個數據庫： crawldb，linkdb 和 segments。

觀察後臺執行的 nohup crawl 作業：
[devalone@nutch nutch]$ jobs
[1]+ 運行中 nohup crawl -i -D solr.server.url=http://nutch.sansovo.org:8983/solr/crawl/ -s urls/ testcrawl 3 &

仍在運行中...

雖然腳本仍在運行，打開 solr 鏈接：http://nutch.sansovo.org:8983/solr/#/crawl/query

執行搜索任務，已經可以搜索到結果了，說明 crawl 腳本至少完成了一輪抓取過程。

...

運行完成時：
[devalone@nutch nutch]$ jobs -l
[1]+ 16512 完成 nohup crawl -i -D solr.server.url=http://nutch.sansovo.org:8983/solr/crawl/ -s urls/ testcrawl 3 &

   [devalone@nutch nutch]$ head nohup.out
   Injecting seed URLs
   /opt/nutch/nutch/bin/nutch inject testcrawl//crawldb urls/
   Injector: starting at 2018-07-24 14:35:11
   Injector: crawlDb: testcrawl/crawldb
   Injector: urlDir: urls

   [devalone@nutch nutch]$ tail nohup.out
   Indexer: number of documents indexed, deleted, or skipped:
   Indexer:   1057 indexed (add/update)
   Indexer: finished at 2018-07-24 16:15:54, elapsed: 00:00:08
   Cleaning up index if possible
   /opt/nutch/nutch/bin/nutch clean -Dsolr.server.url=http://nutch.sansovo.org:8983/solr/crawl/ testcrawl/crawldb
   SolrIndexer: deleting 3/3 documents
   2018年 07月 24日星期二 16:15:57 CST : Finished loop with 3 iterations


對比啓動和結束時間，共運行了 1:40:43。建立了 1057 個 URL 的 Solr 索引。並且最後一步執行了 Solr 清理。

打開 solr 鏈接：http://nutch.sansovo.org:8983/solr/#/crawl/query 執行搜索，獲取更多的搜索結果。

6. 本地安裝 Solr (Setup Solr for search)
-------------------------------------------------------------------------------------------------------------------------
每一個 Nutch版本對應一個特定的 Solr版本進行開發測試，但也可以嘗試使用一個接近的版本。下面是 Nutch和 Solr版本的對應關係：

   +-------+-----------+
   | Nutch   | Solr       |
   +-------+-----------+
   | 1.14   | 6.6.0       |
   +-------+-----------+
   | 1.13   | 5.5.0       |
   +-------+-----------+
   | 1.12   | 5.4.1       |
   +-------+-----------+

6.1 本機模式安裝 Solr
-------------------------------------------------------------------------------------------------------------------------
① 下載 Solr 二進制發佈包

訪問站點 http://archive.apache.org/dist/lucene/solr/ 下載所需版本的二進制發佈包，這裏選擇清華鏡像站點 6.6.5 版本：

[devalone@nutch solr]$ wget https://mirrors.tuna.tsinghua.edu.cn/apache/lucene/solr/6.6.5/solr-6.6.5.tgz

② 解壓到合適的安裝目錄，本例爲 /opt/solr/ 目錄：

   [devalone@nutch solr]$ tar -zxvf solr-6.6.5.tgz
   [devalone@nutch solr]$ pwd
   /opt/solr
   [devalone@nutch solr]$ ll
   總用量 0
   drwxrwxr-x 9 devalone devalone 201 7月 23 16:30 solr-6.6.5

   [devalone@nutch solr]$ ll solr-6.6.5/
   總用量 1412
   drwxr-xr-x 3 devalone devalone    147 6月 30 13:28 bin
   -rw-r--r-- 1 devalone devalone 693983 6月 30 13:28 CHANGES.txt
   drwxr-xr-x 12 devalone devalone    202 6月 30 14:22 contrib
   drwxrwxr-x 4 devalone devalone   4096 7月 23 16:30 dist
   drwxrwxr-x 3 devalone devalone     38 7月 23 16:30 docs
   drwxr-xr-x 7 devalone devalone    105 7月 23 16:30 example
   drwxr-xr-x 2 devalone devalone 24576 7月 23 16:30 licenses
   -rw-r--r-- 1 devalone devalone 12646 2月 22 05:53 LICENSE.txt
   -rw-r--r-- 1 devalone devalone 648213 6月 30 13:28 LUCENE_CHANGES.txt
   -rw-r--r-- 1 devalone devalone 25549 5月 18 17:59 NOTICE.txt
   -rw-r--r-- 1 devalone devalone   7271 5月 18 17:59 README.txt
   drwxr-xr-x 10 devalone devalone    157 7月 23 16:30 server

建立符號鏈接，方便使用：

   [devalone@nutch solr]$ sudo ln -s solr-6.6.5 solr
   [devalone@nutch solr]$ ll
   總用量 0
   lrwxrwxrwx 1 root     root      10 7月 23 16:34 solr -> solr-6.6.5
   drwxrwxr-x 9 devalone devalone 201 7月 23 16:30 solr-6.6.5

   [devalone@nutch solr]$ cd solr
   [devalone@nutch solr]$ pwd
   /opt/solr/solr

編輯 /etc/profile 文件，創建環境變量 SOLR_INSTALL, 並將 $SOLR_INSTALL/bin 加入 PATH 環境變量：

[devalone@nutch solr]$ sudo vi /etc/profile

SOLR_INSTALL=/opt/solr/solr
PATH=$SOLR_INSTALL/bin:$PATH

   export PATH JAVA_HOME ANT_HOME NUTCH_INSTALL SOLR_INSTALL


   ● NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   SOLR_HOME 環境變量已被 solr 佔爲他用，因此不要使用 SOLR_HOME 作爲 solr 安裝目錄的環境變量，否則在啓動 solr 時會出現
   如下錯誤：

       [devalone@nutch solr]$ bin/solr start

       Solr home directory /opt/solr-6.6.0 must contain a solr.xml file!

   本例使用 SOLR_INSTALL 作爲 solr 安裝目錄的環境變量，運行沒有問題。


保存退出，執行：

   [devalone@nutch solr]$ source /etc/profile

使其立即生效。

[devalone@nutch solr]$ solr -version
6.6.5
OK.

③ 爲一個新 nutch solr core 創建資源：

   [devalone@nutch solr]$ cd server/solr/configsets/
   [devalone@nutch configsets]$ cp -r basic_configs nutch

④ 複製 nutch schema.xml 到 ${SOLR_INSTALL}/server/solr/configsets/nutch/conf 目錄中

   [devalone@nutch solr]$ cp $NUTCH_INSTALL/conf/schema.xml $SOLR_INSTALL/server/solr/configsets/nutch/conf

   [devalone@nutch solr]$ ll $SOLR_INSTALL/server/solr/configsets/nutch/conf/
   總用量 168
   -rw-r--r-- 1 devalone devalone 3974 7月 23 16:49 currency.xml
   -rw-r--r-- 1 devalone devalone 1412 7月 23 16:49 elevate.xml
   drwxr-xr-x 2 devalone devalone 4096 7月 23 16:49 lang
   -rw-r--r-- 1 devalone devalone 58074 7月 23 16:49 managed-schema
   -rw-r--r-- 1 devalone devalone   308 7月 23 16:49 params.json
   -rw-r--r-- 1 devalone devalone   873 7月 23 16:49 protwords.txt
   -rw-rw-r-- 1 devalone devalone 22956 7月 23 16:55 schema.xml
   -rw-r--r-- 1 devalone devalone 55039 7月 23 16:49 solrconfig.xml
   -rw-r--r-- 1 devalone devalone   781 7月 23 16:49 stopwords.txt
   -rw-r--r-- 1 devalone devalone 1124 7月 23 16:49 synonyms.txt

⑤ 確保 $SOLR_INSTALL/server/solr/configsets/nutch/conf/ 目錄內沒有 managed-schema 文件：

[devalone@nutch solr]$ rm $SOLR_INSTALL/server/solr/configsets/nutch/conf/managed-schema

⑥ 啓動 solr 服務器：

[devalone@nutch solr]$ solr start
NOTE: Please install lsof as this script needs it to determine if Solr is listening on port 8983.

Started Solr server on port 8983 (pid=5425). Happy searching!

⑦ 創建 nutch core ：

[devalone@nutch solr]$ bin/solr create -c nutch -d server/solr/configsets/nutch/conf/

Copying configuration to new core instance directory:
/opt/solr/solr/server/solr/nutch

Creating new core 'nutch' using command:
http://localhost:8983/solr/admin/cores?action=CREATE&name=nutch&instanceDir=nutch

   {
      "responseHeader":{
       "status":0,
       "QTime":1399},
      "core":"nutch"}

⑧ 將 core name 添加到 Solr server URL:

   -Dsolr.server.url=http://localhost:8983/solr/nutch

   或：
   -Dsolr.server.url=http://nutch.sansovo.org:8983/solr/nutch/

   ● NOTE:
   ---------------------------------------------------------------------------------------------------------------------
   這個字符串用於 nutch 命令中向 nutch 命令傳遞 solr 的服務器信息，例如 nutch index 命令，利用此字符串作爲參數，爲 solr
   中名稱爲 nutch 的 core 建立索引。有了索引之後，才能通過 http://localhost:8983/solr/nutch 進行搜索服務，如 4.5 節所述。



6.2 驗證 Solr 安裝
-------------------------------------------------------------------------------------------------------------------------
安裝完成並啓動 Solr 之後，應該可以通過如下鏈接訪問 Solr:

   http://localhost:8983/solr/#/

   或：
   http://nutch.sansovo.org:8983/solr/#/

可以導航到 nutch core 並查看 managed-schema 等等。

-------------------------------------------------------------------------------------------------------------------------
參考：

https://wiki.apache.org/nutch/NutchTutorial

走進 Apache Nutch (v1.14)

Python 潮流週刊#52：Python 處理 Excel 的資源

HBase 協處理器 (一)

HBase 協處理器 (二)

HBase 客戶端類型（二）

HBase 客戶端類型（三）

走進 Linux shell （一）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結