nutch 2.1 分佈式hbase部署

官方文檔：http://wiki.apache.org/nutch/Nutch2Tutorial?action=show&redirect=GORA_HBase

現在網上針對nutch 2.0 以上版本的部署內容很殘缺。經過兩天奮戰，終於把nutch 2.1在hbase上部署成功了！在此與網友分享。

準備兩臺機器：
cr5(master)：192.168.8.185，cr8(slave)：192.168.8.188
這兩臺機器必須保證相互的ssh是通的（具體可以問谷歌）
修改兩臺機器的/etc/hostname文件


cr5
或者
cr8

修改兩臺機器的/etc/hosts文件


192.168.8.185   cr5
192.168.8.188   cr8

我準備在cr5機上運行進程：

Hadoop: NameNode, SecondaryNameNode, JobTracker
Hbase: HMaster

在cr8機上運行進程：

Hadoop: DataNode, TaskTracker
Hbase: HQuorumPeer, HRegionServer

接下來我們開始部署hadoop和hbase
官網上有很多hadoop和hbase的版本，並不是所有的版本都可以被nutch 2.1支持的。

官方文檔有這麼一句話：
•Install and configure HBase. You can get it here (N.B. Gora 0.2 uses HBase 0.90.4, however the setup is known to work with more recent versions of the HBase 0.90.x branch)

保險起見還是採用推薦的 hbase 0.90.x 版本吧。

我選擇的是 hadoop-1.0.4 和 hbase-0.90.6

那如果採用其他版本在運行nutch的時候會報以下異常


Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)V

我覺得是因爲gora的原因，因爲gora的版本已經很久沒有更新。

[size=large][b]一、配置hadoop[/b][/size]
1. wget 命令下載對應的hadoop版本.tar.gz
2. tar zxvf hadoop版本.tar.gz 解壓hadoop
3. cd conf 下修改配置文件
a. hadoop-env.sh

export JAVA_HOME=/opt/jdk1.6.0_21

b. core-site.xml


       <?xml version="1.0"?>
       <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
         <!-- Put site-specific property overrides in this file. -->
         <configuration>
            <property>
                <name>fs.default.name</name>
                <value>hdfs://cr5:9000/</value>
            </property>
            <property>
                <name>hadoop.tmp.dir</name>
                <value>/home/kfs/ww/data/hadoop_tmp</value>
                <description>此處設置hadoop根目錄</description>
            </property>
         </configuration>

c. hdfs-site.xml


       <?xml version="1.0"?>
       <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
       <!-- Put site-specific property overrides in this file. -->
          <configuration>
             <property>
                <name>dfs.replication</name>
                <value>1</value>
                <description>副本個數</description>
             </property>
          </configuration>

d. mapred-site.xml


      <?xml version="1.0"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      <!-- Put site-specific property overrides in this file. -->
        <configuration>
           <property>
                <name>mapred.job.tracker</name>
                <value>cr5:9001</value>
                <description>jobtracker 標識：端口號</description>
           </property>
        </configuration>

e. masters

cr5

f. slaves

cr8

配置完成後，將cr5的hadoop 複製到 cr8下面
在cr5 的hadoop/bin 下面運行
./hadoop -namenode format
./hadoop -datanode format

然後啓動hadoop
./start-all.sh

檢查啓動成功與否
查看hadoop/logs下面的×.log日誌確保沒有異常
然後通過
http://localhost:50030
http://localhost:50070
來查看信息

[size=large][b]二、配置hbase[/b][/size]
1. wget 命令下載對應的hbase版本.tar.gz
2. tar zxvf hbase版本.tar.gz 解壓hadoop
3. cd conf 下修改配置文件
a. hbase-site.xml


<configuration>
        <property>
                <name>hbase.rootdir</name>
                <value>hdfs://cr5:9000/hbase</value>
        </property>

        <property>
                <name>hbase.cluster.distributed</name>
                <value>true</value>
        </property>

        <property>
                <name>hbase.zookeeper.quorum</name>
                <value>cr8</value>
        </property>

        <property>
                <name>hbase.zookeeper.property.dataDir</name>
                <value>/home/kfs/ww/data/zookeeper_data</value>
        </property>

        <property>
                <name>hbase.zookeeper.property.clientPort</name>
                <value>2181</value>
        </property>

        <property>
                <name>hbase.tmp.dir</name>
                <value>/home/kfs/ww/data/hbase_tmp</value>
        </property>
</configuration>

[color=red]注意：這裏的hdfs://cr5:9000/hbase和hadoop配置需對應[/color]

b. hadoop-env.sh


export JAVA_HOME=/opt/jdk1.6.0_21
export HBASE_CLASSPATH=~/ww/hbase-0.90.6/conf
export HBASE_MANAGES_ZK=true

c. regionservers

cr8

hbase 配置完成
當然還有後續的工作
1. 刪除hbase中的hadoop-core-版本.jar，然後把hadoop中的hadoop-core-版本.jar和commons-collections-3.2.1.jar拷貝到hbase的lib中。
否則hbase 的HMaster無法啓動！
2. 關閉防火牆

到hbase/bin 下通過 ./start-hbase.sh 啓動hbase
驗證啓動與否查看log是否有異常
或者 http://localhost:60010查看具體信息

[size=large][b]三、nutch 配置[/b][/size]
部署到eclipse中就不加累述了，主要是配置～
1. wget 命令下載對應的hadoop版本.tar.gz
2. tar zxvf hadoop版本.tar.gz 解壓hadoop
3. cd conf 下修改配置文件
a. gora.properties

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

b. nutch-site.xml


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>http.agent.name</name>
		<value>test-nutch</value>
	</property>

	<property>
		<name>http.robots.agents</name>
		<value>test-nutch,*</value>
	</property>

	<property>
		<name>http.agent.name.check</name>
		<value>true</value>
	</property>

	<!-- property> <name>plugin.includes</name> <value>.*</value> <description>Enable 
		all plugins during unit testing.</description> </property -->

	<property>
		<name>distributed.search.test.port</name>
		<value>60000</value>
		<description>TCP port used during junit testing.</description>
	</property>

	<property>
		<name>http.accept.language</name>
		<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
		<description>Value of the “Accept-Language” request header field.
			This
			allows selecting non-English language as default one to retrieve.
			It
			is a useful setting for search engines build for certain national
			group.
		</description>
	</property>

	<property>
		<name>parser.character.encoding.default</name>
		<value>utf-8</value>
		<description>The character encoding to fall back to when no other
			information
			is available
		</description>
	</property>

	<property>
		<name>storage.data.store.class</name>
		<value>org.apache.gora.hbase.store.HBaseStore</value>
		<description>The Gora DataStore class for storing and retrieving data.
			Currently the following stores are available: ….
		</description>
	</property>

	<property>
		<name>hadoop.tmp.dir</name>
		<value>C:/data/hadoop_tmp</value>
		<description>此處設置hadoop根目錄</description>
	</property>

</configuration>

c. nutch-site.xml


<property>
  <name>plugin.folders</name>
  <value>./src/plugin</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

d. hbase-site.xml


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

	<property>
		<name>hbase.master</name>
		<value>cr5:60000</value>
	</property>

	<property>
		<name>hbase.zookeeper.quorum</name>
		<value>cr8</value>
	</property>

	<property>
		<name>hbase.zookeeper.property.clientPort</name>
		<value>2181</value>
	</property>

</configuration>

e. ivy.xml


    <dependency org="org.apache.gora" name="gora-hbase" rev="0.2.1" conf="*->default" />

f. 新建urls文件夾，然後再文件夾中創建seed.txt ，在seed.txt中寫入需要抓取的鏈接
g. regex-urlfilter.txt 加入抓取條件正則

配置完成，後續工作
nutch中的hbase-版本.jar需和部署的hbase的版本統一

運行nutch
配置Arguments信息
1. Proguam arguments

urls -depth 3 topN 5
這裏的urls就是nutch配置中生成的url種子文件夾

2. VM arguments

-Xms256m -Xmx512m -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

大功告成～～～

[size=large][b]四、運行過程中異常處理[/b][/size]
1. point org.apache.nutch.net.URLNormalizer not found.請參見[url]http://youkimra.iteye.com/blog/1039903[/url]

2. ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.io.IOException: Failed to set permissions of path: \tmp\hadoop-admin\mapred\local\ttprivate to 0700
請參見：[url]http://download.csdn.net/detail/java2000_wl/4326323[/url]

3. nutch 中有一些plugin 的類缺少包，遇到問題補全包即可

轉載請註明來自：[url]http://wangwei3.iteye.com/blog/1818599[/url]

nutch 2.1 分佈式hbase部署

linux安裝cuda和cudnn

模擬手機設備：使用 Playwright 實現移動端自動化測試

Mellanox網卡開啓SR-IOV

全面系統的AI學習路徑，幫助普通人也能玩轉AI

HTML 00 Tutorial

uni-app實現上拉加載

vue3編譯優化之“靜態提升”

又是一個月-20240513

flask 如何保證返回json有序

linux服務器設置ssh免密

linux(ubuntu) 之間互掛磁盤以及 linux掛載windows磁盤

java 線程個人總結

Heritrix源碼分析(十三) Heritrix的控制中心(大腦)CrawlController(二)

爬蟲基本原理及概念

mysql locked 解決方案

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結