winXP下的Nutch初见

原創

liliflashfly

2020-06-29 09:14

昨晚睡得很早，却始终无心入眠。

一直很想尝试下nutch，却总是说事情太多，没精力去做。

这学期给了自己太多的无奈，看不到目标，很迷茫。

半小时后，终究还是起床。

lucene和nutch早已在硬盘里，上网看资料教材，照着弄了半天，总算出了最初的结果。

结合几篇教程做个总结吧。

首先是软件，我是用的是nutch1.0。将文件解压放到D:/nutch/nutch-1.0中。

然后是cygwin，用于在windows中执行linux命令，昨凌晨在线安装了很久...尤其是到90%的时候停止了，超级崩溃。选了它的第二个链接从新开始，估计是在凌晨3点左右搞定了。

最初测试：

在Cygwin中进入D:/nutch/nutch-1.0，输入bin/nutch

显示：usage:nutch COMMAND即一切还顺利~

貌似会有异常抛出，暂时不理~

开始配置

在D:/nutch/nutch-1.0下建文件夹urls，在此文件夹下建文本文件nutch.txt，其中的内容为：http://lucene.apache.org/nutch/，就是抓取网页的入口位置，据说可以多个地址或者多个文件，还未尝试~

打开D:/nutch/nutch-1.0/conf/crawl-urlfilter.txt，找到MY.DOMAIN.NAME，修改为：

+%5Ehttp://([a-z0-9]*/.)*apache.org/

这应该是它抓取链接的过滤设置，此处可以看到只有apache.org/结尾的链接才会被抓取

配置nutch-default.xml：

找到property标签，将其中<value>空着的部分填写完整，<description>中是对此项的描述，下面是我的写法：

<property> <name>http.agent.name</name> <value>NutchCVS</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> <property> <name>http.robots.agents</name> <value>*</value> <description>The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* </description> </property> <property> <name>http.robots.403.allow</name> <value>true</value> <description>Some servers return HTTP status 403 (Forbidden) if /robots.txt doesn't exist. This should probably mean that we are allowed to crawl the site nonetheless. If this is set to false, then such sites will be treated as forbidden.</description> </property> <property> <name>http.agent.description</name> <value>Nutch</value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value>http://lucene.apache.org/nutch/</value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value>nutch-agent@lucene.apache.org</value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property> <property> <name>http.agent.version</name> <value>Nutch-1.0</value> <description>A version string to advertise in the User-Agent header.</description> </property> <property> <name>http.agent.host</name> <value>LiFangXP</value> <description>Name or IP address of the host on which the Nutch crawler would be running. Currently this is used by 'protocol-httpclient' plugin. </description> </property>

然后，就要开始抓去啦..

Cygwin下执行：

bin/nutch crawl urls -dir crawled -depth 3 -topN 50 >&crawl.log

意思是从urls文件夹中读取地址开始抓取，抓取结果放入crawled文件夹中，抓取深度为3，topN貌似是最大节点数，执行的过程输出到crawl.log文件中。

执行第一次，失败，原来没有JAVA_HOME环境变量，将jdk的安装目录添加上，ok。

执行再次失败，提示D:/nutch/nutch-1.0/crawled/crawldb/current/data没有找到。进入后发现只有文件夹part-00000，里面有个data文件，死马当活马医，拷出来放到current文件夹下，运行，终于开始抓取了~

抓取结束后，需要启动tomcat查看结果。

将nutch1.0文件夹下的nutch1.0.war文件拷贝到tomcat的webapp文件下中，改名为nutch.war，启动tomcat后tomcat会将其解压为nutch文件夹。

进入解压后的webapps/nutch/WEB-INF/classes目录，将<configuration>标签中的nutch-site.xml的search.dir设置为D:/nutch/nutch-1.0/crawled；即抓取来数据的存放位置。

我的配置:

<configuration> <property> <name>http.agent.name</name> <value>nutch</value> <description>none</description> </property> <property> <name>searcher.dir</name> <value>D:/nutch/nutch-1.0/crawled</value> <description>none</description> </property>

至此，即可启动tomcat，输入localhost/nutch进行搜索了~

这些都还只是皮毛，接下里要做更多研究了，刚测试了一下，中文分词采用的是切成一个个的字，看来得替换，我用过的中文切词包有庖丁解牛和IKanalyser，还没在nutch上试过呢，值得去试~

参考教程：

主要参考：

http://www.360doc.com/content/07/0531/13/23378_530951.shtml

其他的：

http://blog.csdn.net/fiwiner/archive/2009/10/19/4699947.aspx

http://nhy520.javaeye.com/blog/382726

http://www.supidea.com/post/nutch_1_0.aspx

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

winXP下的Nutch初见

AI 画图真刺激，手把手教你如何用 ComfyUI 来画出刺激的图

公司刚入职了一名 Java 中级开发，短短 4 行代码居然凑齐了 3 个 bug！我哭了~~

数据展示动态（跑分）显示

公众号5月C#/.NET热文一览

git 下载大陆镜像地址

雜感.

用戶信息挖掘之商品推薦的悲劇……

無心插柳柳成蔭...

winXP下的Nutch初見

以前寫的一個文件夾遍歷代碼

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結