windows構建網頁版搜索引擎 Nutch+Lucene+Mysql+Tomcat(一)

環境:
(這些工具官網都有,自行下載安裝)
nutch 2.2.0
lucene 7.1.0
apache-ant-1.10.1
apache-ivy-2.4.0
apache-tomcat-9.0.1
mysql
jdk-9.0.1_windows-x64_bin

一、eclipse環境下 Nutch+Mysql 二次開發環境

1、通過利用Nutch進行爬蟲,將爬出網頁的內容存入mysql中

2、修改Nutch2.2.1 源碼中的ivy/ivysetings.xml

添加一個源:

   <property name="org.restlet"
    value="http://maven.restlet.org"
    override="false"/>

找到以下部分代碼,將沒有resolver加入

   <chain name="default" dual="true">
      <resolver ref="local"/>
      <resolver ref="maven2"/>
      <resolver ref="apache-snapshot"/>
      <resolver ref="sonatype"/>
      <resolver ref="restlet"/>
    </chain>

經過測試,沒有增加這個有些包下載不了,可能和網絡有關係。

3.修改ivy/ivy.xml

啓用以下兩個依賴

<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />

<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>

4.進入命令行,並定位到Nutch目錄

執行:

ant eclipse -verbose

由於網絡帶寬問題,整個過程執行了半個小時

執行完成之後如下圖所示
這裏寫圖片描述
在編譯的過程中,我出現了以下問題,以及解決方案:

(1)ant Unable to find a javac compiler

java環境沒有配置好

Unable to find a javac compiler;
com.sun.tools.javac.Main is not on the classpath.
Perhaps JAVA_HOME does not point to the JDK
 org.apache.tools.ant.taskdefs.compilers.CompilerAdapterFactory.getCompiler(CompilerAdapterFactory.java:106)
 org.apache.tools.ant.taskdefs.Javac.compile(Javac.java:935)
 org.apache.tools.ant.taskdefs.Javac.execute(Javac.java:764)
 org.apache.jasper.compiler.Compiler.generateClass(Compiler.java:382)
 org.apache.jasper.compiler.Compiler.compile(Compiler.java:472)
 org.apache.jasper.compiler.Compiler.compile(Compiler.java:451)
 org.apache.jasper.compiler.Compiler.compile(Compiler.java:439)
 org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:511)
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:295)
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
 org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

note The full stack trace of the root cause is available in the Apache Tomcat/5.0.28 logs.

解決方案:
首先,你必需檢查一下自己的環境變量是不是正確;這個我想大家都會,只是有時候會忘了定一些,不過檢查一下看看就行了。
其次:在JDK的lib目錄下有一個tools.jar文件,你把它拷到Tomcat安裝目錄下的common/lib目錄下,應該就可以了,你試試吧
最後:如果不可以,在打開tomcat的configue tomcatg ,找到java,在java optioons裏填上:-Djava.home=C:/Program Files/Java/jdk1.5.0_04;就好了。

(2)ivy:resolve doesn’t support the “log” attribute
需要將apache-nutch-2.2.1\ivy\ivy-2.2.0.jar加到CLASSPATH中

5、發現build文件夾比原來多了很多內容。

這裏寫圖片描述

6、打開Eclipse 使用Import 導入Nutch工程

這裏寫圖片描述

7、配置conf/nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>http.agent.name</name>
  <value>YourNutchSpider</value>
 </property>

 <property>
  <name>http.accept.language</name>
  <value>ja-jp, en-us,en-gb,en,zh-cn,zh-tw;q=0.7,*;q=0.3</value>
  <description>Value of the “Accept-Language” request header field.
  This allows selecting non-English language as default one to retrieve.
  It is a useful setting for search engines build for certain national group.</description>
 </property>

 <property>
  <name>parser.character.encoding.default</name>
  <value>utf-8</value>
  <description>The character encoding to fall back to when no other information
  is available</description>
 </property>

 <property>
  <name>plugin.folders</name>
  <value>D:\APP\apache-nutch-2.2\src\plugin</value>
  <description>Directories where nutch plugins are located. Each
  element may be a relative or absolute path. If absolute, it is used
  as is. If relative, it is searched for on the classpath.</description>
 </property>
 <property>

</property>

 <property>
  <name>storage.data.store.class</name>
  <value>org.apache.gora.sql.store.SqlStore</value>
  <description>The Gora DataStore class for storing and retrieving data.
  Currently the following stores are available: ….</description>
 </property>

 <property> 
    <name>generate.batch.id</name> 
    <value>*</value> 
</property>

</configuration>

8、配置 gora.properties

gora.datastore.default=org.apache.gora.sql.store.SqlStore
gora.datastore.autocreateschema=true
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNull
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=root

9、創建mysql數據庫和表結構

(1)由於在我將爬取的網頁存入到mysql表中的時候,會發生varchar(256)不夠用的情況,所以我將varchar改爲最大的varchar(768)
(2)由於在爬取網址的時候會出現表情等四個字節
MySql中報錯:java.sql.SQLException: Incorrect string value: '\xF0\x9F\x90\xBB' for column
解決方法:

1).建表的時候添加如下限制:ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
2).在C:\ProgramData\MySQL\MySQL Server 5.7\my.ini上修改如下:(programData是隱藏文件
------------------my.ini------------------------------------------------------
# For advice on how to change settings please see
# http://dev.mysql.com/doc/refman/5.6/en/server-configuration-defaults.html
[client]
default-character-set=utf8mb4
[mysql]
default-character-set = utf8mb4
# Remove leading # and set to the amount of RAM for the most important data
# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.
# innodb_buffer_pool_size = 128M

# Remove leading # to turn on a very important data integrity option: logging
# changes to the binary log between backups.
# log_bin
# These are commonly set, remove the # and set as required.
# basedir = .....
# datadir = .....
# port = .....
# server_id = .....
# socket = .....

# Remove leading # to set options mainly useful for reporting servers.
# The server defaults are faster for transactions and fast SELECTs.
# Adjust sizes as needed, experiment to find the optimal values.
# join_buffer_size = 128M
# sort_buffer_size = 2M
# read_rnd_buffer_size = 2M

sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES
log-error=/var/log/mysqld.log
long_query_time=3

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
init_connect='SET NAMES utf8mb4'

#log-slow-queries= /usr/local/mysql/log/slowquery.log
 ------------------------------------------------------------------------
3).重啓mysql服務,service mysql stop;  service mysql start;問題解決。
造成這個問題的原因(網上找的):
mysql中規定utf8字符的最大字節數是3,但是某些unicode字符轉成utf8編碼之後有4個字節,導致出錯。

(3)、爬取網頁的過程中會發現Nutch默認最大的content內容爲65536

Content of size 79091 was truncated to 65536

解決方案:
修改config/nutch-default.xml(-1代表爬取內容不限)

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>
<property>
  <name>parser.skip.truncated</name>
  <value>false</value>
  <description>Boolean value for whether we should skip parsing for truncated documents. By default this 
  property is activated due to extremely high levels of CPU which parsing can sometimes take.  
  </description>
</property>
CREATE TABLE nutch.webpage(

id varchar(768) NOT NULL,
headers blob,
text longtext DEFAULT NULL,
status int(11) DEFAULT NULL,
markers blob,
parseStatus blob,
modifiedTime bigint(20) DEFAULT NULL,
prevModifiedTime bigint(20) DEFAULT NULL,
score float DEFAULT NULL,
typ varchar(768) CHARACTER SET latin1 DEFAULT NULL,
batchId varchar(768) CHARACTER SET latin1 DEFAULT NULL,
baseUrl varchar(768) DEFAULT NULL,
content longblob,
title text DEFAULT NULL,
reprUrl varchar(768) DEFAULT NULL,
fetchInterval int(11) DEFAULT NULL,
prevFetchTime bigint(20) DEFAULT NULL,
inlinks mediumblob,
prevSignature blob,
outlinks mediumblob,
fetchTime bigint(20) DEFAULT NULL,
retriesSinceFetch int(11) DEFAULT NULL,
protocolStatus blob,
signature blob,
metadata blob,
PRIMARY KEY (id)
) ENGINE=InnoDB CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

10、配置Crawler.java 的執行參數

右鍵Crawler.java文件run->run configuration->java Application->Arguments
如果不需要爬這麼深的話,只需要減小depth和topN的值

urls -depth 10 -topN 500
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

這裏寫圖片描述
但一般運行Crawel.java都會出現以下問題:
Exception in thread “main” java.io.IOException: Failed to set permissions of path: \tmp\hadoop-Administrator\mapred\staging\Administrator606301699.staging to 0700
解決hadoop運行在windows下的錯誤:(由於windows平臺問題,需要修改FileUtil.java 代碼)需要註釋掉以下兩個句子,否則在執行Crawl 過程中會報Hadoop的路徑權限錯誤。

private static void checkReturnValue(boolean rv, File p, FsPermission permission)
    throws IOException
  {
    //if (!rv)
    //  throw new IOException(new StringBuilder().append("Failed to set permissions of path: ").append(p).append(" to ").append(String.format("%04o", new Object[] { Short.valueOf(permission.toShort()) })).toString());
  }

由於自己修改class文件需要反編譯,所以我這裏提供了一份已經修改編譯好的hadoop-core-1.2.0

修改過的註釋掉checkReturnValue方法後hadoop-core-1.2.0的hadoop-core-1.2.0.jar包,另有FileUtil.java編譯後的class文件,可用於替換掉hadoop-core-1.2.0.jar中對應的編譯類

11、創建需要爬取的網頁urls

在工程目錄創建urls 文件夾,並在文件夾中創建seed.txt文件

添加需要爬取的網站URL路徑,如: http://www.cnblogs.com/

注意:這個urls文件夾與Crawler執行參數的urls 對應。

12、執行Crawler.java 觀察Mysql 數據

13、在大多數情況下,網站可能配置了反爬蟲的功能robots.txt

Nutch也遵守了該協議,但可以通過修改Nutch的源碼繞過反爬蟲。

只需要將類FetcherReducer 的以下這個代碼註釋掉即可

/*
if (!rules.isAllowed(fit.u.toString())) {
// unblock
fetchQueues.finishFetchItem(fit, true);
if (LOG.isDebugEnabled()) {
LOG.debug("Denied by robots.txt: " + fit.url);
}
output(fit, null, ProtocolStatusUtils.STATUS_ROBOTS_DENIED,
CrawlStatus.STATUS_GONE);
continue;
}
*/

參考資料
[1]. http://blog.csdn.net/zoucui/article/details/1419019
[2]. http://www.cnblogs.com/lilies/p/5607388.html

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章