nutch 初體驗

原創

StevenCoder

2020-06-10 09:51

因爲nutch中就有Hadoop，所以在其中配置Hadoop和原本的Hadoop配置幾乎相同。

唯一不同的就是要配置

1.所有節點的nutch-site.xml文件

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>http.agent.name</name>

<value>nutch-1.0</value>

<description>爬蟲和搜索此參數必須配置</description>

</property>

</configuration>

2.配置所有節點上的conf/crawl-urlfilter.txt文件

# skip file:, ftp:, & mailto: urls

-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse

-/.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops

-.*(/[^/]+)/[^/]+/1/[^/]+/1/

# accept hosts in MY.DOMAIN.NAME

# 允許下載所有

# skip everything else

然後

bin/hadoop dfs -put crawltest/urls urls（crawltest/urls 是自己定義的種子）

bin/nutch crawl urls -dir data -depth 3 -topN 10 （爬蟲將全部數據爬到data中）

爬完之後

bin/hadoop fs –get data data(竟然還要下載到本地！！！！！)

安裝tomcat

將Nutch主目錄下的WEB前端程序nutch-1.0.war複製到 ***/ tomcat/webapps/目錄下。

瀏覽器中輸入http://localhost:8080/nutch-1.0，將自動解壓nutch-1.0.war，在webapps下生成nutch-1.0目錄。

配置WEB前端程序中的nutch-site.xml文件，該文件所在目錄是***/tomcat/webapps/nutch-1.0/WEB-INF/classes/下，配置如下：

<name>http.agent.name</name> 不可少，否則無搜索結果

<value>nutch-1.0</value>

<description>HTTP 'User-Agent' request header.</description>

</property>

<name>searcher.dir</name>

<value>D:/data< alue> data是爬蟲生成的索引數據目錄。參數值請使用絕對路徑

<description>Path to root of crawl.</description>

</property>

(6)重啓tomcat。更改配置文件後必須重啓tomcat，否則不會生效。

(7)在http://localhost:8080/nutch-1.0下檢索關鍵字。

貌似有分佈式檢索的方法，下一步再研究

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

nutch 初體驗

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

Erlang：RabbitMQ源碼分析 3. supervisor和supervisor2深入分析

Erlang：RabbitMQ源碼分析 5. worker pool 實現分析

nutch 初體驗

Hadoop的combiner嘗試

頻繁模式挖掘總結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結