Apache Nutch
Apache Nutch 起源於 Apache Lucene 項目,是高可擴展性和高可伸縮性的開源 web 爬蟲軟件項目。項目主頁:
http://nutch.apache.org/
出於底層數據存儲多樣性的設計,目前該項目在兩個代碼分支上持續開發,分別是:
● Nutch 1.x :成熟的產品級 web 爬蟲,這個分支通過精細的優化配製,充分利用了具有非常強大的批處理能力的Apache Hadoop數據
結構。目前該分支最新版本是 2017 年12月23日發佈的 Nutch 1.14,基於 Hadoop 2.7.4 版本開發。
通過如下網址訪問該版本的發佈報告:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12340218
● Nutch 2.x :從 1.x 分離出來的新興版本,最重要的不同在於:將底層具體的數據存儲抽象爲存儲層,利用 Apache Gora 作爲數據
中間層來處理對象的持久化映射。這樣用戶可以實現非常靈活數據模型來存儲任何數據到各種類型的 NoSQL 存儲方案中,
例如,抓取時間(etch time), 狀態(status), 內容(content), 解析的文本(parsed text), 出鏈(outlinks),入鏈 (
inlinks) 等等。
目前該分支的最新版本是 2016 年1月21日發佈的 Nutch 2.3.1。通過如下網址訪問該版本的發佈報告:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12329371
該版本建議的 Gora 後端數據存儲層爲:
□ Apache Avro 1.7.6
□ Apache Hadoop 1.2.1 and 2.5.2
□ Apache HBase 0.98.8-hadoop2 (although also tested with 1.X)
□ Apache Cassandra 2.0.2
□ Apache Solr 4.10.3
□ MongoDB 2.6.X
□ Apache Accumlo 1.5.1
□ Apache Spark 1.4.1
根據 2018-1-17 和 2018-4-18 的 nutch 核心團隊會議備忘錄(Nutch Board meeting minutes),目前 nutch 2.4 版正在開發
當中,計劃幾周之後發佈:
https://whimsy.apache.org/board/minutes/Nutch.html
...
The release of 2.4 is on the agenda.
...
We plan to release 2.4 during the next weeks.
通過其可插入式和模塊化的設計,Nutch 提供了可擴展性接口,例如 Parse,Index 以及 ScoringFilter 用於用戶自定義實現。例如,
Apache Tika 就是 Parse 的實現。另外,有爲 Apache Solr, SolrCloud, Elastic Search 等實現的可插入式索引 (indexing) 模塊。
Nutch 可以在單獨的機器上運行,但要獲得其強大的高可擴展性和高可伸縮性分佈式處理能力還是要將其運行在 Hadoop 集羣上。
1. 系統要求 (Requirements):
-------------------------------------------------------------------------------------------------------------------------
● Linux/Unix 環境,或 windows Cygwin 環境
● JDK 1.8/Java 8
● Apache Ant (只有從源代碼構建時需要)
2. Nutch 安裝 (Install Nutch):
-------------------------------------------------------------------------------------------------------------------------
下載頁面:
http://nutch.apache.org/downloads.html
Nutch 本地安裝有兩種選擇:
2.1 二進制發佈包安裝(from a binary distribution) 步驟
-------------------------------------------------------------------------------------------------------------------------
① 下載 nutch 二進制發佈包:
儘量從本地區鏡像站點下載,速度較快,下面是清華鏡像站點下載地址:
https://mirrors.tuna.tsinghua.edu.cn/apache/nutch/1.14/apache-nutch-1.14-bin.tar.gz
② 解壓下載的二進制包:
[devalone@nutch ~]$ tar -zxvf apache-nutch-1.14-bin.tar.gz
[devalone@nutch ~]$ ll
總用量 243272
drwxrwxr-x 7 devalone devalone 123 7月 19 15:50 apache-nutch-1.14
-rw-rw-r-- 1 devalone devalone 249107211 7月 19 15:48 apache-nutch-1.14-bin.tar.gz
③ 安裝 apache-nutch-1.14
將解壓的 得到 apache-nutch-1.14 文件夾移動到合適的安裝位置,例如安裝到 /opt/nutch/ 目錄下:
[devalone@nutch ~]$ sudo mv apache-nutch-1.14 /opt/nutch/
創建合適的符號鏈接,以利於不同版本 nutch 管理,例如:
[devalone@nutch nutch]$ pwd
/opt/nutch
[devalone@nutch nutch]$ ls -l
總用量 0
drwxrwxr-x 7 devalone devalone 123 6月 29 16:52 apache-nutch-1.14
lrwxrwxrwx 1 root root 8 6月 29 17:01 nutch -> nutch1.x
lrwxrwxrwx 1 root root 17 6月 29 17:00 nutch1.x -> apache-nutch-1.14
編輯 /etc/profile 文件,在其中建立 NUTCH_INSTALL 環境變量,修改 PATH 環境變量,將 nutch/bin 包含到 PATH 中,如下所示:
JAVA_HOME=/usr/java/default
PATH=$JAVA_HOME/bin:$PATH
NUTCH_INSTALL=/opt/nutch/nutch
PATH=$NUTCH_INSTALL/bin:$PATH
export PATH JAVA_HOME NUTCH_INSTALL
保存文件,執行 source /etc/profile 使其立即生效。
NOTE:
---------------------------------------------------------------------------------------------------------------------
NUTCH_HOME 環境變量在 nutch 腳本內部使用,因此在 shell 環境中配置時儘量不要使用這個變量,以免造成不必要的麻煩。
NOTE:
---------------------------------------------------------------------------------------------------------------------
Nutch 發佈的二進制發佈包只適用於本機模式,不適用於分佈式環境運行。因此,要在 hadoop 分佈式或僞分佈式環境中運行 Nutch,
必須使用源代碼編譯方式。
2.2 Nutch 源碼安裝(from a source distribution) 步驟
-------------------------------------------------------------------------------------------------------------------------
2.2.1 構建 Nutch 編譯環境
-------------------------------------------------------------------------------------------------------------------------
源碼安裝 Nutch 需要 ant 配合 ivy 依賴類庫構建編譯環境,過程如下:
[devalone@online apache]$ wget http://mirrors.tuna.tsinghua.edu.cn/apache//ant/binaries/apache-ant-1.10.5-bin.tar.gz
[devalone@online apache]$ wget http://mirrors.hust.edu.cn/apache//ant/ivy/2.5.0-rc1/apache-ivy-2.5.0-rc1-bin.tar.gz
[devalone@online apache]$ ll
總用量 8352
-rw-rw-r--. 1 devalone devalone 5829814 7月 13 18:10 apache-ant-1.10.5-bin.tar.gz
-rw-rw-r--. 1 devalone devalone 2716983 4月 12 23:46 apache-ivy-2.5.0-rc1-bin.tar.gz
[devalone@nutch apache]$ tar -zxvf apache-ant-1.10.5-bin.tar.gz
[devalone@nutch apache]$ tar -zxvf apache-ivy-2.5.0-rc1-bin.tar.gz
[devalone@nutch apache]$ sudo mv apache-ant-1.10.5-bin.tar.gz /usr/local/lib/apache/ant/
[devalone@nutch apache]$ cd /usr/local/lib/apache/ant/
[devalone@nutch ant]$ sudo ln -s apache-ant-1.10.5 ant
[devalone@nutch ant]$ pwd
/usr/local/lib/apache/ant/ant
將 ivy 的 jar 包複製到 ant 的 lib/ 目錄下:
[devalone@nutch apache-ivy-2.5.0-rc1]$ cp ivy-2.5.0-rc1.jar /usr/local/lib/apache/ant/ant/lib/
NOTE:
---------------------------------------------------------------------------------------------------------------------
實際上 nutch 源碼包 ivy/ 目錄下自帶了 ivy-2.4.0.jar 文件,也可以直接使用這個 jar 包文件,而不必自己單獨下載 ivy 的 jar
文件。
編輯 /etc/profile 文件,設置 ANT_HOME 環境變量,並將 ant/bin 添加到 PATH 中:
JAVA_HOME=/usr/java/default
PATH=$JAVA_HOME/bin:$PATH
ANT_HOME=/usr/local/lib/apache/ant/ant
PATH=$ANT_HOME/bin:$PATH
NUTCH_INSTALL=/opt/nutch/nutch
PATH=$NUTCH_INSTALL/bin:$PATH
export PATH JAVA_HOME ANT_HOME NUTCH_INSTALL
保存退出,執行:
[devalone@nutch ant]$ source /etc/profile
使其立即生效。
[devalone@nutch ant]$ ant -version
Apache Ant(TM) version 1.10.5 compiled on July 10 2018
編譯環境安裝完成。
2.2.2 執行 Nutch 源碼安裝
-------------------------------------------------------------------------------------------------------------------------
① 下載 Nutch 源碼包,同樣使用清華鏡像站點下載:
https://mirrors.tuna.tsinghua.edu.cn/apache/nutch/1.14/apache-nutch-1.14-src.tar.gz
② 解壓源碼包:
[devalone@nutch ~]$ tar -zxvf apache-nutch-1.14-src.tar.gz
[devalone@nutch ~]$ ll
總用量 5268
drwxrwxr-x 3 devalone devalone 109 7月 20 10:12 apache
drwxrwxr-x 7 devalone devalone 162 7月 20 10:36 apache-nutch-1.14
-rw-rw-r-- 1 devalone devalone 5390394 7月 20 10:36 apache-nutch-1.14-src.tar.gz
③ 進入解壓後的源碼目錄: apache-nutch-1.14
[devalone@nutch ~]$ cd apache-nutch-1.14
[devalone@nutch apache-nutch-1.14]$ ll
總用量 504
-rw-rw-r-- 1 devalone devalone 54529 12月 19 2017 build.xml
-rw-rw-r-- 1 devalone devalone 106848 12月 19 2017 CHANGES.txt
drwxr-xr-x 2 devalone devalone 4096 7月 20 10:36 conf
-rw-rw-r-- 1 devalone devalone 6297 12月 19 2017 default.properties
drwxr-xr-x 3 devalone devalone 17 12月 19 2017 docs
drwxr-xr-x 2 devalone devalone 148 7月 20 10:36 ivy
drwxr-xr-x 3 devalone devalone 20 12月 19 2017 lib
-rw-rw-r-- 1 devalone devalone 329066 12月 19 2017 LICENSE.txt
-rw-rw-r-- 1 devalone devalone 427 12月 19 2017 NOTICE.txt
drwxr-xr-x 7 devalone devalone 76 12月 19 2017 src
④ 在當前目錄下執行源碼編譯:
● sonar-ant-task-2.2.jar
---------------------------------------------------------------------------------------------------------------------
編譯 nutch 依賴 sonar-ant-task-2.2.jar 任務包,但 ivy 無法加載該包的資源定義 org/sonar/ant/antlib.xml,需要手動下載,
否則出現如下錯誤:
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
ivy-download:
[taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
下載地址:
https://download.csdn.net/download/devalone/10554266
將下載解壓的 sonar-ant-task-2.2.jar 包文件複製到 ant/lib/ 目錄下:
[devalone@nutch ant]$ sudo cp ~/sonar-ant-task-2.2.jar ant/lib/
● 編輯 ivy/ivysettings.xml 文件,替換較快的 ivy 倉庫:
---------------------------------------------------------------------------------------------------------------------
[devalone@nutch apache-nutch-1.14]$ vi ivy/ivysettings.xml
Nutch 的 ivysettings.xml 配置了 3 個 maven 模塊倉庫:
<property name="oss.sonatype.org"
value="http://oss.sonatype.org/content/repositories/releases/"
override="false"/>
<property name="repo.maven.org"
value="http://repo1.maven.org/maven2/"
override="false"/>
<property name="repository.apache.org"
value="https://repository.apache.org/content/repositories/snapshots/"
override="false"/>
將這 3 個倉庫替換爲較快的倉庫,示例:
http://maven.aliyun.com/nexus/content/groups/public/
http://maven.aliyun.com/nexus/content/repositories/central/
http://maven.aliyun.com/nexus/content/repositories/apache-snapshots/
本例使用本地網絡環境內配置好的 maven 倉庫:
<property name="oss.sonatype.org"
value="http://repo.sansovo.org:8081/nexus/content/groups/public/"
override="false"/>
<property name="repo.maven.org"
value="http://repo.sansovo.org:8081/nexus/content/groups/public/"
override="false"/>
<property name="repository.apache.org"
value="http://repo.sansovo.org:8081/nexus/content/groups/public-snapshot/"
override="false"/>
● 執行編譯:
---------------------------------------------------------------------------------------------------------------------
[devalone@nutch apache-nutch-1.14]$ ant
Buildfile: /home/devalone/apache-nutch-1.14/build.xml
Trying to override old definition of task javac
ivy-probe-antlib:
ivy-download:
ivy-download-unchecked:
ivy-init-antlib:
ivy-init:
init:
clean-default-lib:
resolve-default:
[ivy:resolve] :: Apache Ivy 2.5.0-rc1 - 20180412005306 :: http://ant.apache.org/ivy/ ::
[ivy:resolve] :: loading settings :: file = /home/devalone/apache-nutch-1.14/ivy/ivysettings.xml
[ivy:resolve] downloading http://repo.sansovo.org:8081/nexus/content/groups/public/org/apache/commons
/commons-compress/1.14/commons-compress-1.14.jar ...
[ivy:resolve] ........................................................................................
......................................... (517kB)
[ivy:resolve] .. (0kB)
[ivy:resolve] [SUCCESSFUL ] org.apache.commons#commons-compress;1.14!commons-compress.jar (6470ms)
[ivy:resolve] downloading http://repo.sansovo.org:8081/nexus/content/groups/public/org/apache/hadoop/
hadoop-common/2.7.4/hadoop-common-2.7.4.jar ...
[ivy:resolve] ........................................................................................
..............................................
....
BUILD SUCCESSFUL
Total time: 28 minutes 5 seconds
NOTE:
---------------------------------------------------------------------------------------------------------------------
第一次構建過程中最好不要中斷,以免 ivy 依賴庫文件下載損壞,使編譯發生一些奇怪的錯誤。如果確實出現錯誤,可以清除掉
~/.ivy2/cache/ 目錄重新構建。
之後如果再次構建,~/.ivy2 庫已完整下載,構建會很快完成,如:
...
BUILD SUCCESSFUL
Total time: 35 seconds
⑤ 編譯完成後,在源碼目錄下生成 runtime 目錄,該目錄下即是構建出的二進制可執行文件包:
[devalone@nutch apache-nutch-1.14]$ ll runtime/
總用量 0
drwxrwxr-x 3 devalone devalone 46 7月 21 11:36 deploy
drwxrwxr-x 7 devalone devalone 67 7月 21 11:37 local
其中 deploy 目錄爲分佈式可執行文件包,local 目錄爲本地模式可執行文件包。將 local 目錄按照 "二進制發佈包安裝" 中提示的方法
安裝到合適的目錄下即可。
NOTE:
---------------------------------------------------------------------------------------------------------------------
執行 ant clean 指令會清除 runtime 目錄,注意備份配置文件。
2.3 驗證 Nutch 安裝
-------------------------------------------------------------------------------------------------------------------------
無論使用哪種方式安裝,運行如下指令:
[devalone@nutch ~]$ nutch
nutch 1.14
Usage: nutch COMMAND
where COMMAND is one of:
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
...
輸出 nutch 程序的版本號和使用方法,以及命令簡要說明,說明 Nutch 本地模式運行已成功安裝。
NOTE:
---------------------------------------------------------------------------------------------------------------------
最好將本機主機名指向準確的 IP 地址,避免不必要的錯誤,下面是這臺測試機器的設置(有 DNS 配置更好):
[devalone@nutch ~]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 nutch.sansovo.org nutch
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 nutch.sansovo.org nutch
192.168.1.113 nutch.sansovo.org nutch
3. 爬取 web 站點
-------------------------------------------------------------------------------------------------------------------------
在爬取 web 站點之前,nutch 要求必須要做兩處配置:
① 配置爬取屬性,至少要配置爬蟲名稱。
② 配置要爬取的 URL 種子列表。
3.1 配置爬取屬性
-------------------------------------------------------------------------------------------------------------------------
默認的爬取屬性由 $NUTCH_INSTALL/conf/nutch-default.xml 文件設置,不要在這個文件內修改配置,即多數情況下不要修改此文件,最好
將其設爲只讀屬性。nutch-default.xml 文件中包含了所有屬性及其說明,應花些時間仔細研讀每一個屬性的說明,有利於理解和深度配置
nutch 程序。
$NUTCH_INSTALL/conf/nutch-site.xml文件用於配置自定義的爬取屬性,這個文件中配置的屬性會覆蓋 $NUTCH_INSTALL/conf/nutch-default.xml
文件中配置的默認屬性值。因此,應將所有的自定義配置屬性設置在 $NUTCH_INSTALL/conf/nutch-site.xml 文件中。
● nutch 要求在運行前至少要配置 http.agent.name 屬性。因此打開 conf/nutch-site.xml 文件,配置類似如下的內容:
[devalone@nutch nutch]$ vi conf/nutch-site.xml
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
該屬性值隨便設置一個合理的字符串,只要不是空串或空白串就好。
NOTE:
---------------------------------------------------------------------------------------------------------------------
http.agent.name 屬性的配置其實是有些講究的。有些站點會配置 robots 協議或者其他機制禁止網絡爬蟲的爬取。針對 robots 協
議,可以將 http.agent.name 設置爲正常瀏覽器 User-Agent 請求頭屬性值。獲取方法很多,這裏不贅述。示例:
<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
或者 Linux 上的:
<value>Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0</value>
可通過百度資源平臺 https://ziyuan.baidu.com/robots 對網站的 robots 配置進行檢測分析,瞭解某個網站的 robots 配置情況。
● 另外要確認 plugin.includes 屬性中包含所索引器插件 indexer-solr,用於之後建立 solr 索引。該屬性在默認配置文件
conf/nutch-default.xml 中設置爲:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|\
urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library. Set parsefilter-naivebayes for classification based focused crawler.
</description>
</property>
可以看出,indexer-solr 插件已經包含到系統配置中,因此,保持默認配置就好。
NOTE:
---------------------------------------------------------------------------------------------------------------------
注意 plugin.includes 屬性中並沒有包含 protocol-httpclient,因此按照屬性描述,當前配置應只支持 http 協議,不支持https
協議的 URL。要使 nutch 支持 https 協議,需要向該屬性中添加 protocol-httpclient 插件的支持。但底層的commons-httpclient
類庫不太穩定,可能會發生問題。出於穩定性的目的,暫時不配置 protocol-httpclient 插件支持,以後需要時再來精心設置並測試
其穩定性。
但 plugin.includes 屬性對 protocol-httpclient 插件的描述並不準確,後面的測試會特意增加一個 https 協議的 URL,以測試
nutch 的執行結果,並得出相應結論。
3.2 配置 URL 種子列表(URL seed list)
-------------------------------------------------------------------------------------------------------------------------
● URL 種子列表包含 nutch 要搜索的 web 站點 URL 列表,一行一個 URL。
● conf/regex-urlfilter.txt 過濾器文件使用正則表達式來過濾和限定允許 nutch 爬取和下載的的資源類型。
3.2.1 創建 URL 種子列表
-------------------------------------------------------------------------------------------------------------------------
創建一個特定的目錄用於存儲 URL 列表文件,在該目錄下創建一個或多個文本文件,每個文本文件就是一個種子列表文件。編輯文件,向
其中添加要爬取的 web 站點 URL 地址,一行一個 URL。
示例:
[devalone@nutch nutch]$ pwd
/opt/nutch/nutch
[devalone@nutch nutch]$ mkdir -p urls
[devalone@nutch nutch]$ cd urls
[devalone@nutch urls]$ vi seed.txt
[devalone@nutch urls]$ cat seed.txt
http://news.sohu.com/
http://mil.sohu.com/
https://blog.csdn.net/
爲了驗證 https 協議,特意在列表中添加了一個 https 協議的 URL: https://blog.csdn.net/
3.2.2 配置正則表達式過濾器(可選的)
-------------------------------------------------------------------------------------------------------------------------
正則表達式過濾器由 conf/regex-urlfilter.txt 文件配置,用於限定 nutch 爬取的資源在可控的範圍內。如果失去控制,nutch 很可能
爬取整個互聯網的 web 資源,恐怕會耗費幾周的時間。
其中任何一個非註釋的,非空白的行包含一個正則表達式。
正則表達式以 + 或 - 開始:
● + 符號: 表示允許 nutch 爬取的資源。
● - 符號: 表示禁止 nutch 爬取的資源。
一個 URL 在此文件中的第一個匹配模式確定了其是包含到爬取列表還是被忽略掉。如果在此文件中沒有任何模式匹配,則該 URL 被忽略。
本例限定 nutch 在種子列表中列出的網址範圍內爬取,其它網站忽略,因此編輯 conf/regex-urlfilter.txt 文件,註釋掉如下的正則表
達式:
# accept anything else
# +.
添加如下正則表達式:
+^https?://(news|mil)\.sohu\.com/
+^https?://blog\.csdn\.net/
NOTE:
---------------------------------------------------------------------------------------------------------------------
如果沒有在 regex-urlfilter.txt 文件中指定域名,會導致在種子 URL 頁面中鏈接的所有 URL 都會被爬取,數量不可控。
4. 使用單步命令進行全網爬取 (Using Individual Commands for Whole-Web Crawling)
-------------------------------------------------------------------------------------------------------------------------
全網爬取 (Whole-Web Crawling) 是設計用於在多部機器上運行來處理非常大規模的爬取任務,可能需要幾周的時間才能完成。單步命令方
式可以在爬取過程中進行更多的控制,分步驟執行爬取。應該注意,全網爬取並不意味要對整個全球 web 網絡進行爬取。可以使用過濾器等
手段對要爬取的 URL 資源列表進行限制,使其限定在一個可控的範圍內,將不感興趣的資源忽略掉,方法如上節所述。
NOTE:
---------------------------------------------------------------------------------------------------------------------
這類任務耗費資源和時間都非常大,本節只對相關概念進行論述,爬行操作使用上一節配置的小範圍 URL 種子列表和過濾器執行。
4.1 分步執行:概念(Step-by-Step: Concepts)
-------------------------------------------------------------------------------------------------------------------------
nutch 數據庫由下列部分組成:
① crawl 數據庫,或稱爲 crawldb :包含 nutch 所瞭解的每一個 URL 的信息,包括其是否被抓取,以及如果抓取過了,抓取的時間等
相關信息。
② link 數據庫, 或稱爲 linkdb : 這個數據庫包含所有已知的鏈接 URL,包括源 URL(source URL) 和鏈接文本。
③ 一系列的段數據(A set of segments): 每一個段數據是一組 URL, 作爲一個抓取單元。每個段(segment)是一個目錄,其中包含如下
子目錄:
● crawl_generate :包含一組要抓取的 URL,文件類型爲 SequenceFile
● crawl_fetch :包含正在抓取的每一個 URL 的狀態信息,文件類型爲 MapFile
● content :包含從每一個 URL 獲取到的原始內容(raw content),文件類型爲 MapFile
● parse_text : 包含每一個 URL 解析後的文本,文件類型爲 MapFile
● parse_data :包含從每一個 URL 中解析出來的所有外鏈接(outlinks)和元數據信息,文件類型爲 MapFile
● crawl_parse : 包含外鏈接(outlink) URL, 用於更新 crawldb,文件類型爲 SequenceFile
4.2 分步執行:將初始 URL 種子列表注入 crawldb (Step-by-Step: Seeding the crawldb with a list of URLs)
-------------------------------------------------------------------------------------------------------------------------
這是分步執行的第一步,使用 injector 命令將 URL 種子列表添加到 crawldb 數據庫中。執行這一過程很簡單,但也是非常重要的一步。
對於全網爬取的 URL 種子列表,應該精心選擇。
傳統的 DMOZ 曾經是各種網絡爬蟲和搜索引擎所青睞的全球 URL 網址列表,但其網站 http://www.dmoz.org/ 已於 2017 年 3 月 14 日
永久關閉,不過目前有很多類似的商業化 DMOZ 網站興起,例如:
http://www.dmoztools.net
http://www.dmozdir.org
http://www.chinadmoz.org
http://www.webdmoz.org/
http://www.cdmoz.com/
...
其中收錄了大量的 URL 網址,可以在其中找尋到並獲取適用於自己全網爬取任務的 URL 列表,通過直接使用 nutch 程序,或者編寫一個
簡單的腳本程序,即可獲取到某個 DMOZ 網站上收錄的所有 URL 來作爲 nutch 的初始 URL 種子列表。
4.2.1 引導注入:
-------------------------------------------------------------------------------------------------------------------------
準備好 URL 種子列表之後,可以進行 nutch 爬取的引導注入,命令爲 nutch inject。
NOTE:
---------------------------------------------------------------------------------------------------------------------
nutch 中,執行沒有任何命令的 nutch, 獲取 nutch 所有命令的簡要說明。執行無任何參數和選項的命令獲取該命令的用法。
命令簡要說明:
inject:inject new urls into the database
命令用法:
[devalone@nutch nutch]$ bin/nutch inject
Usage: Injector [-D...] <crawldb> <url_dir> [-overwrite|-update] [-noFilter] [-noNormalize] [-filterNormalizeAll]
<crawldb> Path to a crawldb directory. If not present, a new one would be created.
<url_dir> Path to URL file or directory with URL file(s) containing URLs to be injected.
A URL file should have one URL per line, optionally followed by custom metadata.
Blank lines or lines starting with a '#' would be ignored. Custom metadata must
be of form 'key=value' and separated by tabs.
Below are reserved metadata keys:
nutch.score: A custom score for a url
nutch.fetchInterval: A custom fetch interval for a url
nutch.fetchInterval.fixed: A custom fetch interval for a url that is not changed by AdaptiveFetchSchedule
Example:
http://www.apache.org/
http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source
-overwrite Overwite existing crawldb records by the injected records. Has precedence over 'update'
-update Update existing crawldb records with the injected records. Old metadata is preserved
-nonormalize Do not normalize URLs before injecting
-nofilter Do not apply URL filters to injected URLs
-filterNormalizeAll
Normalize and filter all URLs including the URLs of existing CrawlDb records
-D... set or overwrite configuration property (property=value)
-Ddb.update.purge.404=true
remove URLs with status gone (404) from CrawlDb
● 執行 URL 種子注入:
進入 $NUTCH_INSTALL 目錄:
[devalone@nutch nutch]$ pwd
/opt/nutch/nutch
devalone@nutch nutch]$ ll
drwxr-xr-x 2 devalone devalone 32 6月 29 16:52 bin
-rw-rw-r-- 1 devalone devalone 106848 12月 19 2017 CHANGES.txt
drwxr-xr-x 2 devalone devalone 4096 7月 21 16:31 conf
drwxr-xr-x 3 devalone devalone 17 12月 19 2017 docs
drwxr-xr-x 3 devalone devalone 8192 6月 29 16:52 lib
-rw-rw-r-- 1 devalone devalone 329066 12月 19 2017 LICENSE.txt
drwxrwxr-x 2 devalone devalone 24 7月 20 18:10 logs
-rw-rw-r-- 1 devalone devalone 427 12月 19 2017 NOTICE.txt
drwxr-xr-x 71 devalone devalone 4096 12月 19 2017 plugins
drwxrwxr-x 2 devalone devalone 22 7月 21 16:25 urls
執行注入:
[devalone@nutch nutch]$ bin/nutch inject crawl/crawldb urls
Injector: starting at 2018-07-23 10:50:25
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 3
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 3
Injector: finished at 2018-07-23 10:50:29, elapsed: 00:00:03
□ 指定 crawldb 目錄爲當前目錄下的 crawl/crawldb (如果不存在,nutch 會自動創建)。
□ 種子目錄爲當前目錄下之前創建好的 urls (目錄內已準備好 URL 種子列表文件,本例爲 seed.txt,該目錄下可以存在多個種子
文件或子目錄,執行注入時會將該目錄及子目錄內的種子文件全部注入到 crawldb 中)。
4.2.2 驗證注入結果
-------------------------------------------------------------------------------------------------------------------------
注入完成後,查看當前目錄下內容:
[devalone@nutch nutch]$ ll crawl
總用量 0
drwxrwxr-x 4 devalone devalone 32 7月 23 10:50 crawldb
crawl/crawldb 目錄已創建。
[devalone@nutch nutch]$ ll crawl/crawldb/
總用量 0
drwxrwxr-x 3 devalone devalone 26 7月 23 10:50 current
drwxrwxr-x 2 devalone devalone 6 7月 23 10:50 old
其下創建了兩個子目錄:current 和 old。繼續探查 current 目錄:
[devalone@nutch nutch]$ ll crawl/crawldb/current/
總用量 0
drwxrwxr-x 2 devalone devalone 66 7月 23 10:50 part-r-00000
[devalone@nutch nutch]$ ll crawl/crawldb/current/part-r-00000/
總用量 8
-rw-r--r-- 1 devalone devalone 263 7月 23 10:50 data
-rw-r--r-- 1 devalone devalone 213 7月 23 10:50 index
已創建實際的數據文件 data 和索引文件 index,並且文件非空,說明數據已注入到數據庫中,狀態良好。
如果要進一步驗證數據庫中的數據,可以使用 readdb 命令,示例如下:
[devalone@nutch nutch]$ bin/nutch readdb crawl/crawldb -dump crawldb-dump
CrawlDb dump: starting
CrawlDb db: crawl/crawldb
CrawlDb dump: done
[devalone@nutch nutch]$ ll crawldb-dump/
總用量 4
-rw-r--r-- 1 devalone devalone 744 7月 23 11:10 part-00000
[devalone@nutch nutch]$ cat crawldb-dump/part-00000
http://mil.sohu.com/ Version: 7
Status: 1 (db_unfetched)
Fetch time: Mon Jul 23 10:50:25 CST 2018
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:
http://news.sohu.com/ Version: 7
Status: 1 (db_unfetched)
Fetch time: Mon Jul 23 10:50:25 CST 2018
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:
https://blog.csdn.net/ Version: 7
Status: 1 (db_unfetched)
Fetch time: Mon Jul 23 10:50:25 CST 2018
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:
3 個初始種子 URL 存在於數據庫中,狀態良好。
4.3 分步執行:抓取 (Step-by-Step: Fetching)
-------------------------------------------------------------------------------------------------------------------------
4.3.1 生成抓取列表
-------------------------------------------------------------------------------------------------------------------------
要執行抓取操作,首先要從爬取數據庫中生成抓取列表,命令爲 nutch generate
命令簡要說明:
generate:generate new segments to fetch from crawl db
命令用法:
Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-expr <expr>] \
[-adddays <numDays>] [-noFilter] [-noNorm] [-maxNumSegments <num>]
生成抓取列表:
[devalone@nutch nutch]$ bin/nutch generate crawl/crawldb crawl/segments
Generator: starting at 2018-07-23 11:31:13
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20180723113116
Generator: finished at 2018-07-23 11:31:17, elapsed: 00:00:04
[devalone@nutch nutch]$ ll crawl
總用量 0
drwxrwxr-x 4 devalone devalone 32 7月 23 11:31 crawldb
drwxrwxr-x 3 devalone devalone 28 7月 23 11:31 segments
[devalone@nutch nutch]$ ll crawl/segments/
總用量 0
drwxrwxr-x 3 devalone devalone 28 7月 23 11:31 20180723113116
爲所有要抓取的頁面生成了一個抓取列表(a fetch list)。抓取列表位於新創建的段目錄(segment directory)內。段目錄是按創建時間命
名的,我們將這個段名保存到 shell 變量 s1 中,以方便後續使用:
[devalone@nutch nutch]$ s1=`ls -d crawl/segments/2* | tail -1`
[devalone@nutch nutch]$ echo $s1
crawl/segments/20180723113116
4.3.2 執行抓取任務
-------------------------------------------------------------------------------------------------------------------------
執行頁面抓取任務的命令是 nutch fetch。
命令簡要說明:
fetch: fetch a segment's pages
命令用法:
Usage: Fetcher <segment> [-threads n]
執行抓取任務: 將環境變量中存儲的段名作爲參數傳遞給 fetch 命令,採用默認線程數量
[devalone@nutch nutch]$ bin/nutch fetch $s1
Fetcher: starting at 2018-07-23 11:53:50
Fetcher: segment: crawl/segments/20180723113116
Fetcher: threads: 10
Fetcher: time-out divisor: 2
...
Fetcher: finished at 2018-07-23 11:53:54, elapsed: 00:00:03
抓取完畢。從輸出中可以看出,默認的抓取線程數量爲 10。
4.3.3 執行解析任務
-------------------------------------------------------------------------------------------------------------------------
抓取到的頁面需要進行解析才能存儲到數據庫中。執行解析的命令爲 nutch parse。
命令簡要說明:
parse: parse a segment's pages
命令用法:
Usage: ParseSegment segment [-noFilter] [-noNormalize]
執行解析任務:將段名作爲參數傳遞給 parse 命令:
[devalone@nutch nutch]$ nutch parse $s1
ParseSegment: starting at 2018-07-23 12:06:29
ParseSegment: segment: crawl/segments/20180723113116
Parsed (234ms):http://mil.sohu.com/
Parsed (46ms):http://news.sohu.com/
Parsed (35ms):https://blog.csdn.net/
ParseSegment: finished at 2018-07-23 12:06:31, elapsed: 00:00:01
3 個抓取到的頁面解析完成。
4.3.4 更新 crawldb 數據庫
-------------------------------------------------------------------------------------------------------------------------
解析任務完成後,將抓取結果更新到 crawldb 數據庫,這是爬取循環的最後一步。更新數據庫的命令爲 nutch updatedb.
命令簡要說明:
updatedb: update crawl db from segments after fetching
命令用法:
Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions]
crawldb CrawlDb to update
-dir segments parent directory containing all segments to update from
seg1 seg2 ... list of segment names to update from
-force force update even if CrawlDb appears to be locked (CAUTION advised)
-normalize use URLNormalizer on urls in CrawlDb and segment (usually not needed)
-filter use URLFilters on urls in CrawlDb and segment
-noAdditions only update already existing URLs, don't add any newly discovered URLs
更新 crawldb 數據庫: 將 crawldb 目錄和段名作爲參數傳遞給 updatedb 命令
[devalone@nutch nutch]$ nutch updatedb crawl/crawldb $s1
CrawlDb update: starting at 2018-07-23 12:15:17
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20180723113116]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2018-07-23 12:15:20, elapsed: 00:00:02
更新完成。現在 crawldb 數據庫中含有所有最初的種子 URL 列表頁面更新後的信息,以及從初始 URL 頁面中新發現的鏈接信息。通過
readdb 命令簡單查看下數據庫中的內容(如果使用已存在的目錄作爲輸出目錄,需要先將其刪除,否則報錯,這是所有 hadoop 相關項目
的共性):
[devalone@nutch nutch]$ rm -rf crawldb-dump/
[devalone@nutch nutch]$ bin/nutch readdb crawl/crawldb -dump crawldb-dump
CrawlDb dump: starting
CrawlDb db: crawl/crawldb
CrawlDb dump: done
[devalone@nutch nutch]$ ll
[devalone@nutch nutch]$ ll crawldb-dump/
總用量 16
-rw-r--r-- 1 devalone devalone 15862 7月 23 12:22 part-00000
[devalone@nutch nutch]$ cat crawldb-dump/part-00000
http://mil.sohu.com/ Version: 7
Status: 2 (db_fetched)
Fetch time: Wed Aug 22 11:53:53 CST 2018
Modified time: Mon Jul 23 11:53:53 CST 2018
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.75
Signature: 89f36893e295bc0b08f1b1e36bb0c49d
Metadata:
_pst_=success(1), lastModified=0
_rs_=16
Content-Type=text/html
nutch.protocol.code=200
http://mil.sohu.com/1468 Version: 7
Status: 1 (db_unfetched)
Fetch time: Mon Jul 23 12:15:19 CST 2018
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.16666667
Signature: null
Metadata:
http://mil.sohu.com/1469 Version: 7
Status: 1 (db_unfetched)
Fetch time: Mon Jul 23 12:15:19 CST 2018
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.16666667
Signature: null
Metadata:
...
可以發現,數據庫中的內容比之前的只有 3 條記錄多出很多。可以使用 readdb 的 -stats 選項查看數據庫的統計信息:
[devalone@nutch nutch]$ bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 57
shortest fetch interval: 30 days, 00:00:00
avg fetch interval: 30 days, 00:00:00
longest fetch interval: 30 days, 00:00:00
earliest fetch time: Mon Jul 23 12:15:00 CST 2018
avg of fetch times: Wed Jul 25 02:07:00 CST 2018
latest fetch time: Wed Aug 22 11:53:00 CST 2018
retry 0: 57
score quantile 0.01: 0.010204081423580647
score quantile 0.05: 0.010204081423580647
score quantile 0.1: 0.010204081423580647
score quantile 0.2: 0.010204081423580647
score quantile 0.25: 0.010204081423580647
score quantile 0.3: 0.016326530277729012
score quantile 0.4: 0.020408162847161293
score quantile 0.5: 0.020408162847161293
score quantile 0.6: 0.030612245202064514
score quantile 0.7: 0.030612245202064514
score quantile 0.75: 0.030612245202064514
score quantile 0.8: 0.0346938773989678
score quantile 0.9: 0.1666666716337204
score quantile 0.95: 0.750765299797057
score quantile 0.99: 1.7033333361148832
min score: 0.010204081423580647
avg score: 0.10526320808812191
max score: 1.75
status 1 (db_unfetched): 54
status 2 (db_fetched): 3
CrawlDb statistics: done
可以看出,當前 crawldb 中共有 57 個 URL,其中 3 個狀態爲 status 2 (db_fetched),即最初的種子列表中的 3 URL 已經抓取完畢。
有 54 個 URL狀態爲 status 1 (db_unfetched),是從最初的 3 個 URL頁面中新發現的連接,還沒有執行頁面抓取,處於 db_unfetched
狀態。
4.3.5 執行第二次次抓取過程
-------------------------------------------------------------------------------------------------------------------------
爲了獲取足夠多的測試數據,再次執行抓取過程,這次生成並抓取包含評分前 1000(top-scoring 1000) 的頁面:其實當前數據庫中只有
54 個未抓取的頁面,這個選項在當前數據庫中執行生成段的操作並未起作用,加上也無妨。
[devalone@nutch nutch]$ nutch generate crawl/crawldb crawl/segments -topN 1000
[devalone@nutch nutch]$ s2=`ls -d crawl/segments/2* | tail -1`
[devalone@nutch nutch]$ echo $s2
crawl/segments/20180723124830
[devalone@nutch nutch]$ nutch fetch $s2
Fetcher: starting at 2018-07-23 12:50:27
Fetcher: segment: crawl/segments/20180723124830
Fetcher: threads: 10
...
Fetcher: finished at 2018-07-23 12:54:37, elapsed: 00:04:10
用時 00:04:10
[devalone@nutch nutch]$ nutch parse $s2
[devalone@nutch nutch]$ nutch updatedb crawl/crawldb $s2
查看數據庫狀態:
[devalone@nutch nutch]$ bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 1361
shortest fetch interval: 30 days, 00:00:00
avg fetch interval: 30 days, 00:15:52
longest fetch interval: 45 days, 00:00:00
earliest fetch time: Mon Jul 23 12:57:00 CST 2018
avg of fetch times: Tue Jul 24 18:51:00 CST 2018
latest fetch time: Thu Sep 06 12:50:00 CST 2018
retry 0: 1360
retry 1: 1
score quantile 0.01: 1.6222300565069806E-4
score quantile 0.05: 2.0923737523844465E-4
score quantile 0.1: 2.170994079538754E-4
score quantile 0.2: 3.1386775372084226E-4
score quantile 0.25: 3.2318823286914267E-4
score quantile 0.3: 3.3286939081000655E-4
score quantile 0.4: 4.436204064404592E-4
score quantile 0.5: 6.01460276577141E-4
score quantile 0.6: 6.462093988934962E-4
score quantile 0.7: 8.13645781966409E-4
score quantile 0.75: 9.000660493791871E-4
score quantile 0.8: 9.7666619066149E-4
score quantile 0.9: 0.0013502709909944855
score quantile 0.95: 0.0023547878954559565
score quantile 0.99: 0.03270121983119422
min score: 1.0855405707843602E-4
avg score: 0.005894269950244464
max score: 1.8964647054672241
status 1 (db_unfetched): 1305
status 2 (db_fetched): 55
status 3 (db_gone): 1
CrawlDb statistics: done
URL 總數已增加到 1361 個,其中有 55 個已抓取,1305 個新發現的鏈接未抓取,1 個處於 status 3 (db_gone) 狀態。
4.3.6 執行第三次抓取過程
-------------------------------------------------------------------------------------------------------------------------
數據庫中有 1305 個未抓取的 URL, 這裏的 -topN 1000 會起作用,限制其生成評分前 1000 個 URL 的段信息。
[devalone@nutch nutch]$ nutch generate crawl/crawldb crawl/segments -topN 1000
[devalone@nutch nutch]$ ll crawl/segments/
總用量 0
drwxrwxr-x 8 devalone devalone 117 7月 23 12:06 20180723113116
drwxrwxr-x 8 devalone devalone 117 7月 23 12:56 20180723124830
drwxrwxr-x 3 devalone devalone 28 7月 23 13:04 20180723130401
[devalone@nutch nutch]$ s3=`ls -d crawl/segments/2* | tail -1`
[devalone@nutch nutch]$ echo $s3
crawl/segments/20180723130401
[devalone@nutch nutch]$ nutch fetch $s3
Fetcher: starting at 2018-07-23 13:06:07
Fetcher: segment: crawl/segments/20180723130401
Fetcher: threads: 10
...
Fetcher: finished at 2018-07-23 14:32:43, elapsed: 01:26:36
此次抓取用時 01:26:36
NOTE:
---------------------------------------------------------------------------------------------------------------------
對於這種耗時的操作,可以聯合 nohup 命令來執行,示例:
[devalone@nutch nutch]$ nohup nutch fetch $s3 &
[devalone@nutch nutch]$ nutch parse $s3
[devalone@nutch nutch]$ nutch updatedb crawl/crawldb $s3
[devalone@nutch nutch]$ nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 18442
shortest fetch interval: 30 days, 00:00:00
avg fetch interval: 30 days, 00:01:10
longest fetch interval: 45 days, 00:00:00
earliest fetch time: Mon Jul 23 12:57:00 CST 2018
avg of fetch times: Wed Jul 25 07:35:00 CST 2018
latest fetch time: Thu Sep 06 12:50:00 CST 2018
retry 0: 18433
retry 1: 9
score quantile 0.01: 3.363983068993548E-6
score quantile 0.05: 4.0650803027780635E-6
score quantile 0.1: 5.252996839943921E-6
score quantile 0.2: 8.687948599238534E-6
score quantile 0.25: 1.0351363643525419E-5
score quantile 0.3: 1.1638136404260567E-5
score quantile 0.4: 1.5087516350721173E-5
score quantile 0.5: 1.956822216838666E-5
score quantile 0.6: 2.4080784436340625E-5
score quantile 0.7: 2.9829692724346616E-5
score quantile 0.75: 3.42503072351974E-5
score quantile 0.8: 4.092971121440248E-5
score quantile 0.9: 8.750581586759325E-5
score quantile 0.95: 4.3720556372452434E-4
score quantile 0.99: 0.001597540448419741
min score: 0.0
avg score: 4.8001176325052655E-4
max score: 1.8964647054672241
status 1 (db_unfetched): 17394
status 2 (db_fetched): 1039
status 3 (db_gone): 1
status 4 (db_redir_temp): 2
status 5 (db_redir_perm): 6
CrawlDb statistics: done
URL 總數已增加到 18442 個,其中有 1039 個已抓取,17394 個新發現的鏈接未抓取,1 個處於 status 3 (db_gone) 狀態。
■ 驗證 nutch 是否抓取了 https 協議的 URL 頁面
-------------------------------------------------------------------------------------------------------------------------
dump 出狀態爲 db_fetched 的 URL:
[devalone@nutch nutch]$ nutch readdb crawl/crawldb/ -dump crawldb-dump -status db_fetched
CrawlDb dump: starting
CrawlDb db: crawl/crawldb/
CrawlDb dump: done
[devalone@nutch nutch]$ ll crawldb-dump/part-00000
-rw-r--r-- 1 devalone devalone 437326 7月 24 16:12 crawldb-dump/part-00000
查看 dump 出的數據庫內容:
[devalone@nutch nutch]$ cat crawldb-dump/part-00000
https://blog.csdn.net/10km/article/details/81147355 Version: 7
Status: 2 (db_fetched)
Fetch time: Wed Aug 22 13:32:49 CST 2018
Modified time: Mon Jul 23 13:32:49 CST 2018
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 6.664876E-4
Signature: b59a0e8cd511f1090489db04f3518a2b
Metadata:
_pst_=success(1), lastModified=1531722549000
_rs_=147
Content-Type=text/html
nutch.protocol.code=200
https://blog.csdn.net/A_B_C_D_E______/article/details/81142260 Version: 7
Status: 2 (db_fetched)
Fetch time: Wed Aug 22 13:11:52 CST 2018
Modified time: Mon Jul 23 13:11:52 CST 2018
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 3.4539928E-4
Signature: 97a758b6c2f999204168cbc817dd1a88
Metadata:
_pst_=success(1), lastModified=1531722549000
_rs_=156
Content-Type=text/html
nutch.protocol.code=200
https://blog.csdn.net/Ancony_/article/details/81141073 Version: 7
Status: 2 (db_fetched)
Fetch time: Wed Aug 22 14:17:03 CST 2018
Modified time: Mon Jul 23 14:17:03 CST 2018
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 3.4383315E-4
Signature: f9050a1c2ff299c7313d052114a4115d
Metadata:
_pst_=success(1), lastModified=1531722549000
_rs_=175
Content-Type=text/html
nutch.protocol.code=200
...
可以看出,數據庫中含有相當一部分已抓取到的 https 協議的 URL,說明默認的 plugin.includes 屬性設置可以處理 http 和 https
協議的 URL,不需要進一步包含 protocol-httpclient。
● NOTE:
---------------------------------------------------------------------------------------------------------------------
事實上,plugin.includes 的屬性值是一個正則表達式,該屬性在默認配置文件nutch-default.xml 中設置爲:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|\
urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library. Set parsefilter-naivebayes for classification based focused crawler.
</description>
</property>
其中 protocol-http| 部分已經包含了對 protocol-httpclient 字符串的匹配。因此,默認情況下,nutch 能夠直接載入並使用
protocol-httpclient 插件來處理 https 協議, 只是該屬性的描述(<description> 元素內容) 對 protocol-httpclient 插件的說明
令人困惑。這應該是以前版本遺留下來的一個文檔 bug,除了會給使用者造成困惑之外,不會影響 nutch 的執行,因此不必理會。
4.4 分步執行: 消除重複 URL (Step-by-Step: Deleting Duplicates)
-------------------------------------------------------------------------------------------------------------------------
完成數據抓取任務之後,需要解決重複 URL 的問題,以確保 crawldb 數據庫中 URL 的唯一性,這樣就保證了建立索引時 URL 是唯一的。
消除重複 URL 使用 nutch dedup 命令。
命令簡要說明:
dedup:deduplicate entries in the crawldb and give them a special status
詳細描述:
-------------------------------------------------------------------------------------------------------------------------
The dedup command is an alias for the class org.apache.nutch.crawl.DeduplicationJob and is available since Nutch 1.8 (Not
yet in Nutch 2.x as of May 2014,即截止 2014 年5月,Nutch 2.x 中還沒有 dedup 命令).
This command takes a path to a crawldb as parameter and finds duplicates based on the signature. If several entries share
the same signature, the one with the highest score is kept. If the scores are the same, then the fetch time is used to
determine which one to keep with the most recent one being kept. If their fetch times are the same we keep the one with the
shortest URL. The entries which are not kept have their status changed to STATUS_DB_DUPLICATE, this is then used by the
Cleaning and Indexing jobs to delete the corresponding documents in the backends (SOLR, Elasticsearch).
The dedup command replaces the SOLR dedup command which was limited to SOLR and was a lot less efficient.
dedup 命令從 Nutch 1.8 開始替換 SOLR dedup 命令,用法上也有所不同。應該在建立 Solr 索引之前執行,以保證 URL在 crawldb 數據
庫和 Solr 索引中的唯一性。
命令用法:
Usage: DeduplicationJob <crawldb> [-group <none|host|domain>] [-compareOrder <score>,<fetchTime>,<urlLength>]
消除重複:
[devalone@nutch nutch]$ nutch dedup crawl/crawldb
DeduplicationJob: starting at 2018-07-24 12:10:16
Deduplication: 1 documents marked as duplicates
Deduplication: Updating status of duplicate urls into crawl db.
Deduplication finished at 2018-07-24 12:10:22, elapsed: 00:00:05
查看 crawldb 數據庫此時的狀態:
[devalone@nutch nutch]$ nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 18442
shortest fetch interval: 30 days, 00:00:00
avg fetch interval: 30 days, 00:01:10
longest fetch interval: 45 days, 00:00:00
earliest fetch time: Mon Jul 23 12:57:00 CST 2018
avg of fetch times: Wed Jul 25 07:35:00 CST 2018
latest fetch time: Thu Sep 06 12:50:00 CST 2018
retry 0: 18433
retry 1: 9
score quantile 0.01: 3.363983068993548E-6
score quantile 0.05: 4.0650803027780635E-6
score quantile 0.1: 5.252996839943921E-6
score quantile 0.2: 8.687948599238534E-6
score quantile 0.25: 1.0351363643525419E-5
score quantile 0.3: 1.1638136404260567E-5
score quantile 0.4: 1.5087516350721173E-5
score quantile 0.5: 1.956822216838666E-5
score quantile 0.6: 2.4080784436340625E-5
score quantile 0.7: 2.9829692724346616E-5
score quantile 0.75: 3.42503072351974E-5
score quantile 0.8: 4.092971121440248E-5
score quantile 0.9: 8.750581586759325E-5
score quantile 0.95: 4.3720556372452434E-4
score quantile 0.99: 0.001597540448419741
min score: 0.0
avg score: 4.8001176325052655E-4
max score: 1.8964647054672241
status 1 (db_unfetched): 17394
status 2 (db_fetched): 1038
status 3 (db_gone): 1
status 4 (db_redir_temp): 2
status 5 (db_redir_perm): 6
status 7 (db_duplicate): 1
CrawlDb statistics: done
此時已抓取的 URL 變爲 1038 個,出現 1 個重複的 URL: status 7 (db_duplicate)。
4.5 分步執行:反轉鏈接 (Step-by-Step: Invertlinks)
-------------------------------------------------------------------------------------------------------------------------
在建立索引之前,首先需要反轉所有的鏈接,這樣才能爲頁面的錨點文本建立索引。反轉鏈接使用命令 invertlinks 。
命令簡要說明:
invertlinks:create a linkdb from parsed segments
命令用法:
Usage: LinkDb <linkdb> (-dir <segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize] [-noFilter]
linkdb output LinkDb to create or update
-dir segmentsDir parent directory of several segments, OR
seg1 seg2 ... list of segment directories
-force force update even if LinkDb appears to be locked (CAUTION advised)
-noNormalize don't normalize link URLs
-noFilter don't apply URLFilters to link URLs
反轉鏈接: 指定鏈接數據庫爲 crawl/linkdb,段目錄爲 crawl/segments
[devalone@nutch nutch]$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting at 2018-07-23 15:19:10
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/opt/nutch/apache-nutch-1.14/crawl/segments/20180723113116
LinkDb: adding segment: file:/opt/nutch/apache-nutch-1.14/crawl/segments/20180723124830
LinkDb: adding segment: file:/opt/nutch/apache-nutch-1.14/crawl/segments/20180723130401
LinkDb: finished at 2018-07-23 15:19:13, elapsed: 00:00:02
4.6 分步執行:爲 Apache Solr 建立索引 (Step-by-Step: Indexing into Apache Solr)
-------------------------------------------------------------------------------------------------------------------------
在執行這一步之前,應該安裝好 Solr。
NOTE:
---------------------------------------------------------------------------------------------------------------------
後面第 6 節會討論 Apache Solr 的安裝過程,現在假設已經安裝完畢。
爲 Apache Solr 建立索引使用 index 命令
命令簡要說明:
index: run the plugin-based indexer on parsed segments and linkdb
命令用法:
Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit]\
[-del eteGone] [-filter] [-normalize] [-addBinaryContent] [-base64]
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
建立索引: 將 Solr server URL 以 Java 參數 -D key=value 的形式傳遞給 index 命令:
-Dsolr.server.url=http://localhost:8983/solr/nutch
執行過程如下所示:
-------------------------------------------------------------------------------------------------------------------------
[devalone@nutch nutch]$ bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/nutch crawl/crawldb -linkdb \
crawl/linkdb -dir crawl/segments/ -filter -normalize -deleteGone
Segment dir is complete: file:/opt/nutch/apache-nutch-1.14/crawl/segments/20180723113116.
Segment dir is complete: file:/opt/nutch/apache-nutch-1.14/crawl/segments/20180723124830.
Segment dir is complete: file:/opt/nutch/apache-nutch-1.14/crawl/segments/20180723130401.
Indexer: starting at 2018-07-23 17:56:53
Indexer: deleting gone documents: true
Indexer: URL filtering: true
Indexer: URL normalizing: true
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
Indexing 250/250 documents
Deleting 0 documents
SolrIndexer: deleting 1/1 documents
Indexing 250/500 documents
Deleting 0 documents
Indexing 250/750 documents
Deleting 0 documents
SolrIndexer: deleting 4/5 documents
Indexing 250/1000 documents
Deleting 0 documents
SolrIndexer: deleting 4/9 documents
Indexing 27/1027 documents
Deleting 0 documents
Indexer: number of documents indexed, deleted, or skipped:
Indexer: 1 deleted (gone)
Indexer: 8 deleted (redirects)
Indexer: 1027 indexed (add/update)
Indexer: finished at 2018-07-23 17:57:14, elapsed: 00:00:21
索引建立完成之後,可以通過如下鏈接執行搜索:
http://localhost:8983/solr/#/nutch/query
或通過域名訪問:
http://nutch.sansovo.org:8983/solr/#/nutch/query
4.7 分步執行:清理 Solr (Step-by-Step: Cleaning Solr)
-------------------------------------------------------------------------------------------------------------------------
單步執行的最後一步,是對 crawldb 和 Solr 索引進行清理。這也是在更新 crawldb 之後經常做的操作。清理使用 nutch clean 命令。
命令簡要說明:
clean :remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
命令用法:
Usage: CleaningJob <crawldb> [-noCommit]
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
執行 clean 命令的 Java class 是 CleaningJob 。該類掃描整個 crawldb 目錄查找狀態爲 DB_GONE (404/301) 的 URL 條目,並向 Solr
發送刪除請求。Solr 收到上述請求之後,及時刪除這些文檔。如此維護了 Solr 索引的健康質量。
清理 Solr:
[devalone@nutch nutch]$ nutch clean crawl/crawldb -Dsolr.server.url=http://nutch.sansovo.org:8983/solr/nutch
SolrIndexer: deleting 2/2 documents
SolrIndexer: deleting 2/2 documents
...
5. 使用 crawl 腳本 (Using the crawl script)
-------------------------------------------------------------------------------------------------------------------------
crawl 腳本集成了上述單步執行的各個步驟,在一個可執行腳本中執行各步驟的命令。即在爬取過程中反覆執行如下過程:
inject->generate->fetch->parse->updatedb
對於非全網的小規模爬取,完全可以使用 crawl 命令完成整個任務。
命令用法:
Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] [-s <Seed Dir>] <Crawl Dir> <Num Rounds>
-i|--index Indexes crawl results into a configured indexer
-D A Java property to pass to Nutch calls
-w|--wait NUMBER[SUFFIX] Time to wait before generating a new segment when no URLs
are scheduled for fetching. Suffix can be: s for second,
m for minute, h for hour and d for day. If no suffix is
specified second is used by default.
-s Seed Dir Path to seeds file(s)
Crawl Dir Directory where the crawl/link/segments dirs are saved
Num Rounds The number of rounds to run this crawl for
其中最後一個參數 <Num Rounds> 迭代次數是一個數字,用於指定執行爬取過程的迭代次數。數字越大執行爬取過程次數越多,抓取 URL
頁面也就越多,抓取深度和廣度更加龐大,耗費的資源和時間呈幾何數量增長。因此在測試環境中不宜設得過大,2~3 次就可以了。
5.1 準備測試環境
-------------------------------------------------------------------------------------------------------------------------
爲了有一個乾淨的測試環境,指定 <Crawl Dir> 爲新的 testcrawl 目錄。爲 solr 創建一個不同的 core, 名稱爲 crawl:
[devalone@nutch solr]$ cd server/solr/configsets/
[devalone@nutch configsets]$ cp -r nutch crawl
[devalone@nutch solr]$ ll
總用量 12
drwxr-xr-x 7 devalone devalone 122 7月 24 14:05 configsets
drwxrwxr-x 2 devalone devalone 6 7月 24 14:03 crawl
drwxrwxr-x 4 devalone devalone 53 7月 23 17:19 nutch
-rw-r--r-- 1 devalone devalone 3037 5月 18 17:59 README.txt
-rw-r--r-- 1 devalone devalone 2117 2月 22 05:53 solr.xml
-rw-r--r-- 1 devalone devalone 975 2月 22 05:53 zoo.cfg
[devalone@nutch solr]$ cd $SOLR_INSTALL
做一次刪除操作,清理環境:
[devalone@nutch solr]$ solr delete -c crawl
Deleting core 'crawl' using command:
http://localhost:8983/solr/admin/cores?action=UNLOAD&core=crawl&deleteIndex=true&deleteDataDir=true&deleteInstanceDir=true
{"responseHeader":{
"status":0,
"QTime":9}}
創建 crawl core:
[devalone@nutch solr]$ bin/solr create -c crawl -d server/solr/configsets/crawl/conf/
Copying configuration to new core instance directory:
/opt/solr/solr/server/solr/crawl
Creating new core 'crawl' using command:
http://localhost:8983/solr/admin/cores?action=CREATE&name=crawl&instanceDir=crawl
{
"responseHeader":{
"status":0,
"QTime":387},
"core":"crawl"}
訪問鏈接 http://nutch.sansovo.org:8983/solr/#/crawl/query 執行查詢任務,應該返回空,因爲新創建的 crawl core還沒有建立索引。
得到 Solr server URL:
-Dsolr.server.url=http://localhost:8983/solr/crawl/
或:
-Dsolr.server.url=http://nutch.sansovo.org:8983/solr/crawl/
5.2 運行 crawl 腳本
-------------------------------------------------------------------------------------------------------------------------
依然使用之前定義的種子文件 urls/seed.txt,設置迭代次數 <Num Rounds> 爲 3 。
由於運行時間可能會比較長,使用 nohup 命令來執行 crawl 腳本:
[devalone@nutch nutch]$ nohup crawl -i -D solr.server.url=http://nutch.sansovo.org:8983/solr/crawl/ -s urls/ testcrawl 3 &
[1] 16512
[devalone@nutch nutch]$ nohup: 忽略輸入並把輸出追加到"nohup.out"
nohup 進入後臺運行,並將屏幕輸出追加到當前目錄下的 nohup.out 文件:
[devalone@nutch nutch]$ ll nohup.out
-rw------- 1 devalone devalone 21470 7月 24 14:36 nohup.out
[devalone@nutch nutch]$ ll nohup.out
-rw------- 1 devalone devalone 65054 7月 24 14:41 nohup.out
可以看到 nohup.out 日誌文件持續增加。
查看最新執行信息:
[devalone@nutch nutch]$ tail nohup.out
-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=815, fetchQueues.getQueueCount=1
FetcherThread 48 fetching https://blog.csdn.net/X8i0Bev/article/details/81091490 (queue crawl delay=5000ms)
-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=814, fetchQueues.getQueueCount=1
-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=814, fetchQueues.getQueueCount=1
-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=814, fetchQueues.getQueueCount=1
-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=814, fetchQueues.getQueueCount=1
-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=814, fetchQueues.getQueueCount=1
FetcherThread 48 fetching https://blog.csdn.net/h8y0bdjvukwe1lbozle/article/details/81117214 (queue crawl delay=5000ms)
-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=813, fetchQueues.getQueueCount=1
-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=813, fetchQueues.getQueueCount=1
查看數據庫目錄:
[devalone@nutch nutch]$ ll testcrawl/
總用量 0
drwxrwxr-x 4 devalone devalone 32 7月 24 14:40 crawldb
drwxrwxr-x 3 devalone devalone 21 7月 24 14:40 linkdb
drwxrwxr-x 5 devalone devalone 72 7月 24 14:40 segments
數據庫目錄 testcrawl 被自動創建,並且在 testcrawl 內創建了 nutch 的三個數據庫: crawldb,linkdb 和 segments。
觀察後臺執行的 nohup crawl 作業:
[devalone@nutch nutch]$ jobs
[1]+ 運行中 nohup crawl -i -D solr.server.url=http://nutch.sansovo.org:8983/solr/crawl/ -s urls/ testcrawl 3 &
仍在運行中...
雖然腳本仍在運行,打開 solr 鏈接:http://nutch.sansovo.org:8983/solr/#/crawl/query
執行搜索任務,已經可以搜索到結果了,說明 crawl 腳本至少完成了一輪抓取過程。
...
運行完成時:
[devalone@nutch nutch]$ jobs -l
[1]+ 16512 完成 nohup crawl -i -D solr.server.url=http://nutch.sansovo.org:8983/solr/crawl/ -s urls/ testcrawl 3 &
[devalone@nutch nutch]$ head nohup.out
Injecting seed URLs
/opt/nutch/nutch/bin/nutch inject testcrawl//crawldb urls/
Injector: starting at 2018-07-24 14:35:11
Injector: crawlDb: testcrawl/crawldb
Injector: urlDir: urls
[devalone@nutch nutch]$ tail nohup.out
Indexer: number of documents indexed, deleted, or skipped:
Indexer: 1057 indexed (add/update)
Indexer: finished at 2018-07-24 16:15:54, elapsed: 00:00:08
Cleaning up index if possible
/opt/nutch/nutch/bin/nutch clean -Dsolr.server.url=http://nutch.sansovo.org:8983/solr/crawl/ testcrawl/crawldb
SolrIndexer: deleting 3/3 documents
2018年 07月 24日 星期二 16:15:57 CST : Finished loop with 3 iterations
對比啓動和結束時間,共運行了 1:40:43。建立了 1057 個 URL 的 Solr 索引。並且最後一步執行了 Solr 清理。
打開 solr 鏈接:http://nutch.sansovo.org:8983/solr/#/crawl/query 執行搜索,獲取更多的搜索結果。
6. 本地安裝 Solr (Setup Solr for search)
-------------------------------------------------------------------------------------------------------------------------
每一個 Nutch版本對應一個特定的 Solr版本進行開發測試,但也可以嘗試使用一個接近的版本。下面是 Nutch和 Solr版本的對應關係:
+-------+-----------+
| Nutch | Solr |
+-------+-----------+
| 1.14 | 6.6.0 |
+-------+-----------+
| 1.13 | 5.5.0 |
+-------+-----------+
| 1.12 | 5.4.1 |
+-------+-----------+
6.1 本機模式安裝 Solr
-------------------------------------------------------------------------------------------------------------------------
① 下載 Solr 二進制發佈包
訪問站點 http://archive.apache.org/dist/lucene/solr/ 下載所需版本的二進制發佈包,這裏選擇清華鏡像站點 6.6.5 版本:
[devalone@nutch solr]$ wget https://mirrors.tuna.tsinghua.edu.cn/apache/lucene/solr/6.6.5/solr-6.6.5.tgz
② 解壓到合適的安裝目錄,本例爲 /opt/solr/ 目錄:
[devalone@nutch solr]$ tar -zxvf solr-6.6.5.tgz
[devalone@nutch solr]$ pwd
/opt/solr
[devalone@nutch solr]$ ll
總用量 0
drwxrwxr-x 9 devalone devalone 201 7月 23 16:30 solr-6.6.5
[devalone@nutch solr]$ ll solr-6.6.5/
總用量 1412
drwxr-xr-x 3 devalone devalone 147 6月 30 13:28 bin
-rw-r--r-- 1 devalone devalone 693983 6月 30 13:28 CHANGES.txt
drwxr-xr-x 12 devalone devalone 202 6月 30 14:22 contrib
drwxrwxr-x 4 devalone devalone 4096 7月 23 16:30 dist
drwxrwxr-x 3 devalone devalone 38 7月 23 16:30 docs
drwxr-xr-x 7 devalone devalone 105 7月 23 16:30 example
drwxr-xr-x 2 devalone devalone 24576 7月 23 16:30 licenses
-rw-r--r-- 1 devalone devalone 12646 2月 22 05:53 LICENSE.txt
-rw-r--r-- 1 devalone devalone 648213 6月 30 13:28 LUCENE_CHANGES.txt
-rw-r--r-- 1 devalone devalone 25549 5月 18 17:59 NOTICE.txt
-rw-r--r-- 1 devalone devalone 7271 5月 18 17:59 README.txt
drwxr-xr-x 10 devalone devalone 157 7月 23 16:30 server
建立符號鏈接,方便使用:
[devalone@nutch solr]$ sudo ln -s solr-6.6.5 solr
[devalone@nutch solr]$ ll
總用量 0
lrwxrwxrwx 1 root root 10 7月 23 16:34 solr -> solr-6.6.5
drwxrwxr-x 9 devalone devalone 201 7月 23 16:30 solr-6.6.5
[devalone@nutch solr]$ cd solr
[devalone@nutch solr]$ pwd
/opt/solr/solr
編輯 /etc/profile 文件,創建環境變量 SOLR_INSTALL, 並將 $SOLR_INSTALL/bin 加入 PATH 環境變量:
[devalone@nutch solr]$ sudo vi /etc/profile
SOLR_INSTALL=/opt/solr/solr
PATH=$SOLR_INSTALL/bin:$PATH
export PATH JAVA_HOME ANT_HOME NUTCH_INSTALL SOLR_INSTALL
● NOTE:
---------------------------------------------------------------------------------------------------------------------
SOLR_HOME 環境變量已被 solr 佔爲他用,因此不要使用 SOLR_HOME 作爲 solr 安裝目錄的環境變量,否則在啓動 solr 時會出現
如下錯誤:
[devalone@nutch solr]$ bin/solr start
Solr home directory /opt/solr-6.6.0 must contain a solr.xml file!
本例使用 SOLR_INSTALL 作爲 solr 安裝目錄的環境變量,運行沒有問題。
保存退出,執行:
[devalone@nutch solr]$ source /etc/profile
使其立即生效。
[devalone@nutch solr]$ solr -version
6.6.5
OK.
③ 爲一個新 nutch solr core 創建資源:
[devalone@nutch solr]$ cd server/solr/configsets/
[devalone@nutch configsets]$ cp -r basic_configs nutch
④ 複製 nutch schema.xml 到 ${SOLR_INSTALL}/server/solr/configsets/nutch/conf 目錄中
[devalone@nutch solr]$ cp $NUTCH_INSTALL/conf/schema.xml $SOLR_INSTALL/server/solr/configsets/nutch/conf
[devalone@nutch solr]$ ll $SOLR_INSTALL/server/solr/configsets/nutch/conf/
總用量 168
-rw-r--r-- 1 devalone devalone 3974 7月 23 16:49 currency.xml
-rw-r--r-- 1 devalone devalone 1412 7月 23 16:49 elevate.xml
drwxr-xr-x 2 devalone devalone 4096 7月 23 16:49 lang
-rw-r--r-- 1 devalone devalone 58074 7月 23 16:49 managed-schema
-rw-r--r-- 1 devalone devalone 308 7月 23 16:49 params.json
-rw-r--r-- 1 devalone devalone 873 7月 23 16:49 protwords.txt
-rw-rw-r-- 1 devalone devalone 22956 7月 23 16:55 schema.xml
-rw-r--r-- 1 devalone devalone 55039 7月 23 16:49 solrconfig.xml
-rw-r--r-- 1 devalone devalone 781 7月 23 16:49 stopwords.txt
-rw-r--r-- 1 devalone devalone 1124 7月 23 16:49 synonyms.txt
⑤ 確保 $SOLR_INSTALL/server/solr/configsets/nutch/conf/ 目錄內沒有 managed-schema 文件:
[devalone@nutch solr]$ rm $SOLR_INSTALL/server/solr/configsets/nutch/conf/managed-schema
⑥ 啓動 solr 服務器:
[devalone@nutch solr]$ solr start
NOTE: Please install lsof as this script needs it to determine if Solr is listening on port 8983.
Started Solr server on port 8983 (pid=5425). Happy searching!
⑦ 創建 nutch core :
[devalone@nutch solr]$ bin/solr create -c nutch -d server/solr/configsets/nutch/conf/
Copying configuration to new core instance directory:
/opt/solr/solr/server/solr/nutch
Creating new core 'nutch' using command:
http://localhost:8983/solr/admin/cores?action=CREATE&name=nutch&instanceDir=nutch
{
"responseHeader":{
"status":0,
"QTime":1399},
"core":"nutch"}
⑧ 將 core name 添加到 Solr server URL:
-Dsolr.server.url=http://localhost:8983/solr/nutch
或:
-Dsolr.server.url=http://nutch.sansovo.org:8983/solr/nutch/
● NOTE:
---------------------------------------------------------------------------------------------------------------------
這個字符串用於 nutch 命令中向 nutch 命令傳遞 solr 的服務器信息,例如 nutch index 命令,利用此字符串作爲參數,爲 solr
中名稱爲 nutch 的 core 建立索引。有了索引之後,才能通過 http://localhost:8983/solr/nutch 進行搜索服務,如 4.5 節所述。
6.2 驗證 Solr 安裝
-------------------------------------------------------------------------------------------------------------------------
安裝完成並啓動 Solr 之後,應該可以通過如下鏈接訪問 Solr:
http://localhost:8983/solr/#/
或:
http://nutch.sansovo.org:8983/solr/#/
可以導航到 nutch core 並查看 managed-schema 等等。
-------------------------------------------------------------------------------------------------------------------------
參考:
https://wiki.apache.org/nutch/NutchTutorial