winXP下的Nutch初見

原創

liliflashfly

2020-06-29 09:14

昨晚睡得很早，卻始終無心入眠。

一直很想嘗試下nutch，卻總是說事情太多，沒精力去做。

這學期給了自己太多的無奈，看不到目標，很迷茫。

半小時後，終究還是起牀。

lucene和nutch早已在硬盤裏，上網看資料教材，照着弄了半天，總算出了最初的結果。

結合幾篇教程做個總結吧。

首先是軟件，我是用的是nutch1.0。將文件解壓放到D:/nutch/nutch-1.0中。

然後是cygwin，用於在windows中執行linux命令，昨凌晨在線安裝了很久...尤其是到90%的時候停止了，超級崩潰。選了它的第二個鏈接從新開始，估計是在凌晨3點左右搞定了。

最初測試：

在Cygwin中進入D:/nutch/nutch-1.0，輸入bin/nutch

顯示：usage:nutch COMMAND即一切還順利~

貌似會有異常拋出，暫時不理~

開始配置

在D:/nutch/nutch-1.0下建文件夾urls，在此文件夾下建文本文件nutch.txt，其中的內容爲：http://lucene.apache.org/nutch/，就是抓取網頁的入口位置，據說可以多個地址或者多個文件，還未嘗試~

打開D:/nutch/nutch-1.0/conf/crawl-urlfilter.txt，找到MY.DOMAIN.NAME，修改爲：

+%5Ehttp://([a-z0-9]*/.)*apache.org/

這應該是它抓取鏈接的過濾設置，此處可以看到只有apache.org/結尾的鏈接纔會被抓取

配置nutch-default.xml：

找到property標籤，將其中<value>空着的部分填寫完整，<description>中是對此項的描述，下面是我的寫法：

<property> <name>http.agent.name</name> <value>NutchCVS</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> <property> <name>http.robots.agents</name> <value>*</value> <description>The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* </description> </property> <property> <name>http.robots.403.allow</name> <value>true</value> <description>Some servers return HTTP status 403 (Forbidden) if /robots.txt doesn't exist. This should probably mean that we are allowed to crawl the site nonetheless. If this is set to false, then such sites will be treated as forbidden.</description> </property> <property> <name>http.agent.description</name> <value>Nutch</value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value>http://lucene.apache.org/nutch/</value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value>nutch-agent@lucene.apache.org</value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property> <property> <name>http.agent.version</name> <value>Nutch-1.0</value> <description>A version string to advertise in the User-Agent header.</description> </property> <property> <name>http.agent.host</name> <value>LiFangXP</value> <description>Name or IP address of the host on which the Nutch crawler would be running. Currently this is used by 'protocol-httpclient' plugin. </description> </property>

然後，就要開始抓去啦..

Cygwin下執行：

bin/nutch crawl urls -dir crawled -depth 3 -topN 50 >&crawl.log

意思是從urls文件夾中讀取地址開始抓取，抓取結果放入crawled文件夾中，抓取深度爲3，topN貌似是最大節點數，執行的過程輸出到crawl.log文件中。

執行第一次，失敗，原來沒有JAVA_HOME環境變量，將jdk的安裝目錄添加上，ok。

執行再次失敗，提示D:/nutch/nutch-1.0/crawled/crawldb/current/data沒有找到。進入後發現只有文件夾part-00000，裏面有個data文件，死馬當活馬醫，拷出來放到current文件夾下，運行，終於開始抓取了~

抓取結束後，需要啓動tomcat查看結果。

將nutch1.0文件夾下的nutch1.0.war文件拷貝到tomcat的webapp文件下中，改名爲nutch.war，啓動tomcat後tomcat會將其解壓爲nutch文件夾。

進入解壓後的webapps/nutch/WEB-INF/classes目錄，將<configuration>標籤中的nutch-site.xml的search.dir設置爲D:/nutch/nutch-1.0/crawled；即抓取來數據的存放位置。

我的配置:

<configuration> <property> <name>http.agent.name</name> <value>nutch</value> <description>none</description> </property> <property> <name>searcher.dir</name> <value>D:/nutch/nutch-1.0/crawled</value> <description>none</description> </property>

至此，即可啓動tomcat，輸入localhost/nutch進行搜索了~

這些都還只是皮毛，接下里要做更多研究了，剛測試了一下，中文分詞采用的是切成一個個的字，看來得替換，我用過的中文切詞包有庖丁解牛和IKanalyser，還沒在nutch上試過呢，值得去試~

參考教程：

主要參考：

http://www.360doc.com/content/07/0531/13/23378_530951.shtml

其他的：

http://blog.csdn.net/fiwiner/archive/2009/10/19/4699947.aspx

http://nhy520.javaeye.com/blog/382726

http://www.supidea.com/post/nutch_1_0.aspx

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

winXP下的Nutch初見

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（二）使用kube-vip實現集羣VIP訪問

.NET週刊【5月第2期 2024-05-12】

雜感.

用戶信息挖掘之商品推薦的悲劇……

無心插柳柳成蔭...

winXP下的Nutch初見

以前寫的一個文件夾遍歷代碼

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結