Hadoop集羣的安裝和配置,主要分爲兩個部分:一部分是主機環境配置,主要是指Hadoop集羣所依賴的操作系統及其相關軟件的安裝配置,包括操作系統安裝、JDK安裝配置、主機規劃與IP地址映射配置、無密碼認證會話配置;另一部分是Hadoop基本配置,主要是指Hadoop集羣的各種基本組件的配置,包括HDFS的配置、MapReduce配置。
- 操作系統:Ubuntu-11.10
- Sun JDK:jdk-6u30-linux-i586.bin
- Hadoop:hadoop-0.22.0.tar.gz
主機環境配置
- JDK安裝配置
cd ~/installation
chmod +x jdk-6u30-linux-i586.bin
./jdk-6u30-linux-i586.bin
JDK配置,需要修改環境變量文件~/.bashrc(vi ~/.bashrc), 在~/.bashrc文件的最後面增加如下配置行,如下所示:export JAVA_HOME=/home/hadoop/installation/jdk1.6.0_30
export CLASSPATH=$JAVA_HOME/lib/*.jar:$JAVA_HOME/jre/lib/*.jar
export PATH=$PATH:$JAVA_HOME/bin
最後,使配置生效,並驗證:. ~/.bashrc
echo $JAVA_HOME
在每臺機器上的配置,都按照上面的步驟進行配置。- 主機規劃與IP地址映射配置
1、主結點master配置
集羣中分爲主結點(Master Node)和從結點(Slave Node)兩種結點。我們選擇一個主結點,兩個從結點進行配置。127.0.0.1 localhost
192.168.0.190 master
192.168.0.186 slave-01
192.168.0.183 slave-02
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
master
2、從結點slave-01配置
127.0.0.1 localhost
192.168.0.190 master
192.168.0.186 slave-01
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
slave-01
3、從結點slave-02配置
127.0.0.1 localhost
192.168.0.190 master
192.168.0.183 slave-02
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
slave-02
4、總結說明
第二,擴展性好。因爲通過/etc/hosts對主機名和IP地址進行了映射,即使IP變了,主機名可以保持不變。在Hadoop安裝的時候,需要配置master和slaves兩個文件,如果這兩個文件都使用IP的話,試想,一個具有200個結點的集羣,因爲一次企業的網絡規劃重新分配網段,導致IP全部變更,那麼這些配置都要進行修改,工作量很大。但是,如果使用主機名的話,Hadoop的配置不需要任何改變,而對於主機名到IP地址的映射配置交給系統管理員去做好了。
- 無密碼認證會話配置
1、基本配置
首先檢查,你的ssh是否安裝並啓動,如果沒有則進行安裝:sudo apt-get install openssh-server
在master主結點上,生成密鑰對:ssh-keygen -t rsa
一路回車下去即可,即可生成公鑰(~/.ssh/id_rsa.pub)和私鑰(~/.ssh/id_rsa)文件。cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 644 ~/.ssh/authorized_keys
驗證配置,執行如下命令:ssh master
如果不需要輸入密碼,即可登錄master(登錄本機),說明配置正確。scp ~/.ssh/id_rsa.pub hadoop@slave-01:/home/hadoop/.ssh/id_rsa.pub.master
這時,因爲結點之間(master到slave-01)要進行數據交換,需要輸入slave-01結點的登錄密碼(hadoop用戶),才能執行文件的遠程拷貝。這裏輸入密碼是正常的,不要與結點之間通過ssh進行無密碼公鑰認證混淆。注意:分發公鑰時,主要修改目標master公鑰文件副本名稱,以防覆蓋掉從結點上已經存在並且正在使用的公鑰文件。
cat ~/.ssh/id_rsa.pub.master >> ~/.ssh/authorized_keys
chmod 644 ~/.ssh/authorized_keys
scp ~/.ssh/id_rsa.pub hadoop@slave-02:/home/hadoop/.ssh/id_rsa.pub.master
cat ~/.ssh/id_rsa.pub.master >> ~/.ssh/authorized_keys
chmod 644 ~/.ssh/authorized_keys
ssh slave-01
ssh slave-02
如果不需要輸入密碼,則配置成功。2、總結說明
(1)爲什麼要將master上的公鑰分發到集羣中各從結點slaves上?一是通過輸入密碼進行認證,你必須知道對方主機的登錄用戶名和口令,才能登錄到對方主機上進行合法的授權操作;
二是不需要密碼就能夠認證,這就需要用到密鑰,當一個主機A訪問另一個主機B時,如果對方主機B的認證密鑰配置了A的公鑰,那麼A訪問B是就能夠通過配置的A的公鑰進行認證,而不需要進行輸入密碼認證。
Hadoop集羣master分發公鑰到slaves結點,並且,在各個slaves結點上配置了公鑰認證,這時,當master通過ssh登錄到slaves結點上以後,可以執行相應的授權操作,例如,當master要停止整個HDFS集羣(namenode、datanode)時,可以在master上就能直接登錄到各個slaves結點上,直接關閉datanode,從而關閉整個集羣。這樣的話,你就不需要分別登錄每個datanode上,去執行相應的關閉操作。
DSA和RSA都可以用於認證,不過是基於不同的加密和解密算法的而已。有關DSA和RSA可以查閱相關資料。
Hadoop集羣配置
- Hadoop集羣基本配置
cd ~/installation
tar -xvzf hadoop-0.22.0.tar.gz
2、配置Hadoop環境變量
export HADOOP_HOME=/home/hadoop/installation/hadoop-0.22.0
export PATH=$PATH:$HADOOP_HOME/bin
. ~/.bashrc
4、配置master和slaves文件
修改conf/master文件,內容如下所示:
master
修改conf/slaves文件,內容如下所示:slave-01
slave-02
5、配置hadoop-env.sh文件
export JAVA_HOME=/home/hadoop/installation/jdk1.6.0_30
其它選項,可以根據需要進行配置。6、配置conf/core-site.xml文件
配置文件conf/core-site.xml的內容,如下所示:<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000/</value>
<description></description>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>fs.inmemory.size.mb</name>
<value>10</value>
<description>Larger amount of memory allocated for the in-memory file-system used to merge map-outputs at the reduces.</description>
</property>
<property>
<name>io.sort.factor</name>
<value>10</value>
<description>More streams merged at once while sorting files.</description>
</property>
<property>
<name>io.sort.mb</name>
<value>10</value>
<description>Higher memory-limit while sorting data.</description>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
<description>Size of read/write buffer used in SequenceFiles.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/storage/tmp/hadoop-${user.name}</value>
<description></description>
</property>
</configuration>
上面配置內容,是與HDFS的基本屬性相關的,一般在系統運行過程中比較固定的配置,都放到這裏面。如果需要根據實際應用的變化,可以配置到hdfs-site.xml文件中,下面會解釋。
7、配置conf/hdfs-site.xml文件
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/storage/name/a,/home/hadoop/storage/name/b</value>
<description>Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/storage/data/a,/home/hadoop/storage/data/b,/home/hadoop/storage/data/c</value>
<description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description>
</property>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
<description>HDFS blocksize of 64MB for large file-systems.</description>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>10</value>
<description>More NameNode server threads to handle RPCs from large number of DataNodes.</description>
</property>
</configuration>
8、配置conf/mapred-site.xml文件
配置文件conf/mapred-site.xml是與MapReduce計算相關的,在實際使用中根據需要進行配置某些參數,如JVM堆內存分配大小等。該配置文件的內容,配置如下所示:<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://master:19830/</value>
<description>Host or IP and port of JobTracker.</description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/hadoop/storage/mapred/system</value>
<description>Path on the HDFS where where the MapReduce framework stores system files.Note: This is in the default filesystem (HDFS) and must be accessible from both the server and client machines.</description>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/hadoop/storage/mapred/local</value>
<description>Comma-separated list of paths on the local filesystem where temporary MapReduce data is written. Note: Multiple paths help spread disk i/o.</description>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>10</value>
<description>The maximum number of Map tasks, which are run simultaneously on a given TaskTracker, individually.Note: Defaults to 2 maps, but vary it depending on your hardware.</description>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
<description>The maximum number of Reduce tasks, which are run simultaneously on a given TaskTracker, individually. Note: Defaults to 2 reduces, but vary it depending on your hardware.</description>
</property>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>5</value>
<description>Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.</description>
</property>
<property>
<name>mapred.map.child.java.opts</name>
<value>-Xmx128M</value>
<description>Larger heap-size for child jvms of maps.</description>
</property>
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xms64M</value>
<description>Larger heap-size for child jvms of reduces.</description>
</property>
<property>
<name>tasktracker.http.threads</name>
<value>5</value>
<description>More worker threads for the TaskTracker's http server. The http server is used by reduces to fetch intermediate map-outputs.</description>
</property>
<property>
<name>mapred.queue.names</name>
<value>default</value>
<description>Comma separated list of queues to which jobs can be submitted. Note: The MapReduce system always supports atleast one queue with the name as default. Hence, this parameter's value should always contain the string default. Some job schedulers supported in Hadoop, like the Capacity Scheduler(http://hadoop.apache.org/common/docs/stable/capacity_scheduler.html), support multiple queues. If such a scheduler is being used, the list of configured queue names must be specified here. Once queues are defined, users can submit jobs to a queue using the property name mapred.job.queue.name in the job configuration. There could be a separate configuration file for configuring properties of these queues that is managed by the scheduler. Refer to the documentation of the scheduler for information on the same.</description>
</property>
<property>
<name>mapred.acls.enabled</name>
<value>false</value>
<description>Boolean, specifying whether checks for queue ACLs and job ACLs are to be done for authorizing users for doing queue operations and job operations. Note: If true, queue ACLs are checked while submitting and administering jobs and job ACLs are checked for authorizing view and modification of jobs. Queue ACLs are specified using the configuration parameters of the form mapred.queue.queue-name.acl-name, defined below under mapred-queue-acls.xml. Job ACLs are described at Job Authorization(http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#Job+Authorization).</description>
</property>
</configuration>
9、安裝文件遠程分發
前面,我們已經在各臺機器上配置好了基礎環境,同時在master上配置好了Hadoop,這時需要將Hadoop拷貝到從結點上相應的目錄下。由於Hadoop發行包中存在很多源碼和文檔,佔用了很大存儲空間,可以將其刪除以後,再進行遠程拷貝,刪除前先做好備份,我暫時刪除了如下文件:
rm -rf common/ hdfs/ contrib/ c++/ mapreduce/
接着,執行遠程拷貝命令:
cd ~/installation
scp -r ~/installation/hadoop-0.22.0/ hadoop@slave-01:/home/hadoop/installation
scp -r ~/installation/hadoop-0.22.0/ hadoop@slave-02:/home/hadoop/installation
10、其它配置
因爲我們在上面的配置中,使用了存儲目錄storage,需要預先在master和slaves結點上創建該目錄,執行如下命令:mkdir ~/storage
- Hadoop集羣配置驗證
1、啓動HDFS集羣
hdfs namenode -format
如果沒有錯誤,繼續執行,啓動HDFS,執行命令:start-dfs.sh
此時:在master上,執行jps你可以看到,啓動了NameNode和SecondaryNameNode;
在slaves上,執行jps你可以看到,啓動了DameNode。
tail -500f $HADOOP_HOME/logs/hadoop-hadoop-namenode-master.log
tail -500f $HADOOP_HOME/logs/hadoop-hadoop-secondarynamenode-master.log
tail -500f $HADOOP_HOME/logs/hadoop-hadoop-datanode-slave-01.log
tail -500f $HADOOP_HOME/logs/hadoop-hadoop-datanode-slave-02.log
master結點: http://master:50070 或者 http://192.168.0.190:50070
slave-01結點:http://slave-01:50075 或者 http://192.168.0.186:50075
slave-02結點:http://slave-02:50075 或者 http://192.168.0.183:50075
start-mapred.sh
此時:tail -500f $HADOOP_HOME/logs/hadoop-hadoop-jobtracker-master.log
tail -500f $HADOOP_HOME/logs/hadoop-hadoop-tasktracker-slave-01.log
tail -500f $HADOOP_HOME/logs/hadoop-hadoop-tasktracker-slave-02.log
還可以通過Hadoop內置的Web Server(Jetty),通過瀏覽器訪問監控:master結點: http://master:50030 或者 http://192.168.0.190:50030
slave-01結點:http://slave-01:50060 或者 http://192.168.0.186:50060
slave-02結點:http://slave-02:50060 或者 http://192.168.0.183:50060
3、上傳文件到HDFS
例如,上傳一個文件到HDFS上,使用copyFromLocal命令:
hadoop@master:~/installation/hadoop-0.22.0$ hadoop fs -lsr
drwxr-xr-x - hadoop supergroup 0 2011-12-31 11:40 /user/hadoop/storage
hadoop@master:~/installation/hadoop-0.22.0$ hadoop fs -mkdir storage/input
hadoop@master:~/installation/hadoop-0.22.0$ hadoop fs -copyFromLocal ~/storage/files/extfile.txt ./storage/input/attractions.txt
hadoop@master:~/installation/hadoop-0.22.0$ hadoop fs -lsr
drwxr-xr-x - hadoop supergroup 0 2011-12-31 11:41 /user/hadoop/storage
drwxr-xr-x - hadoop supergroup 0 2011-12-31 11:41 /user/hadoop/storage/input
-rw-r--r-- 3 hadoop supergroup 66609 2011-12-31 11:41 /user/hadoop/storage/input/attractions.txt
將文件extfile.txt上傳到HDFS,改名爲attractions.txt。
4、運行MapReduce任務
hadoop jar $HADOOP_HOME/hadoop-mapred-examples-0.22.0.jar wordcount ./storage/input/ $HADOOP_HOME/output
5、總結說明
192.168.0.190 master
192.168.0.186 slave-01
192.168.0.183 slave-02
保存以後,這時,你再通過域名訪問Hadoop集羣結點,就可以看到該結點的一些基本信息。