Docker Ubuntu14.04 Hadoop2.6.0

啓動

docker run -it ubuntu:14.04

容器啓動起來了,接下來就是安裝Java、Hadoop及相關配置了。

換源 阿里Ubuntu14.04源

deb http://mirrors.aliyun.com/ubuntu/ trusty main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ trusty-security main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ trusty-updates main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ trusty-proposed main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ trusty-backports main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ trusty main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ trusty-security main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ trusty-updates main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ trusty-proposed main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ trusty-backports main restricted universe multiverse

Java安裝

依次執行如下命令:

sudo apt-get -y install software-properties-common python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sodu apt-get update
apt-get -y install oracle-java8-installer

注意:

  • 這裏安裝的Java8。
  • 默認使用的是Ubuntu的官方源,如果下載比較慢,請自行修改更新源,不知道如何使用命令行修改的,參考這裏

另外,大家可以將裝好java的鏡像保存爲一個副本,他日可以在此基礎上構建其他鏡像。命令如下:

root@122a2cecdd14:~# exit
docker commit -m "java install" 122a2cecdd14 ubuntu:java

上面命令中122a2cecdd14爲當前容器的ID, ubuntu:java是爲新的鏡像指定一個標識,ubuntu倉庫名javaTag

如何獲取容器ID:

  • 有個簡便的辦法找到此ID,就是命令行用戶名@後面的那一串字符。這個方法只在容器啓動時沒有指定hostname時才能用。
  • 使用docker ps列出所有運行的容器,在命令結果中查看

Hadoop安裝

漸漸切入正題了O(∩_∩)O~

使用剛剛已經安裝了Java的容器鏡像啓動:

docker run -ti ubuntu:java

啓動成功了,我們開始安裝Hadoop。這裏,我們直接使用wget下載安裝文件。

1.先安裝wget:

sudo apt-get -y install wget

2.下載並解壓安裝文件:

root@8ef06706f88d:cd /usr/local
root@8ef06706f88d:/usr/local# wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
root@8ef06706f88d:/usr/local# tar xvzf hadoop-2.6.0.tar.gz
root@8ef06706f88d:/usr/local# mv hadoop-2.6.0 hadoop
root@8ef06706f88d:/usr/local# rm hadoop-2.6.0.tar.gz

注意:這裏我們安裝的Hadoop版本是2.6.0,如果需要其他版本,請在這裏找到鏈接地址後修改命令即可。

3.配置環境變量

修改~/.bashrc文件。在文件末尾加入下面配置信息:

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

注意:我們使用apt-get安裝java,不知道java裝在什麼地方的話可以使用下面的命令查看:

root@8ef06706f88d:~# update-alternatives --config java
There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-8-oracle/jre/bin/java
Nothing to configure.
root@8ef06706f88d:~#   

4.配置Hadoop

下面,我們開始修改Hadoop的配置文件。主要配置core-site.xmlhdfs-site.xmlmapred-site.xml這三個文件。

開始配置之前,執行下面命令:

root@8ef06706f88d:# cd $HADOOP_HOME/
root@8ef06706f88d:/usr/local/haoop# mkdir tmp
root@8ef06706f88d:/usr/local/haoop# cd tmp/
root@8ef06706f88d:/usr/local/haoop/tmp# pwd
/usr/local/haoop/tmp
root@8ef06706f88d:/usr/local/haoop/tmp# cd ../
root@8ef06706f88d:/usr/local/haoop# mkdir namenode
root@8ef06706f88d:/usr/local/haoop# cd namenode/
root@8ef06706f88d:/usr/local/haoop/namenode# pwd
/usr/local/haoop/namenode
root@8ef06706f88d:/usr/local/haoop/namenode# cd ../
root@8ef06706f88d:/usr/local/haoop# mkdir datanode
root@8ef06706f88d:/usr/local/haoop# cd datanode/
root@8ef06706f88d:/usr/local/haoop/datanode# pwd
/usr/local/haoop/datanode
root@8ef06706f88d:/usr/local/haoop/datanode# cd $HADOOP_CONFIG_HOME/
root@8ef06706f88d:/usr/local/haoop/etc/hadoop# cp mapred-site.xml.template mapred-site.xml
root@8ef06706f88d:/usr/local/haoop/etc/hadoop# nano hdfs-site.xml

這裏創建了三個目錄,後續配置的時候會用到:

  1. tmp:作爲Hadoop的臨時目錄
  2. namenode:作爲NameNode的存放目錄
  3. datanode:作爲DataNode的存放目錄

1).core-site.xml配置

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
            <name>hadoop.tmp.dir</name>
            <value>/usr/local/hadoop/tmp</value>
            <description>A base for other temporary directories.</description>
    </property>

    <property>
            <name>fs.default.name</name>
            <value>hdfs://master:9000</value>
            <final>true</final>
            <description>The name of the default file system.  A URI whose
            scheme and authority determine the FileSystem implementation.  The
            uri's scheme determines the config property (fs.SCHEME.impl) naming
            the FileSystem implementation class.  The uri's authority is used to
            determine the host, port, etc. for a filesystem.</description>
    </property>
</configuration>

注意:

  • hadoop.tmp.dir配置項值即爲此前命令中創建的臨時目錄路徑。
  • fs.default.name配置爲hdfs://master:9000,指向的是一個Master節點的主機(後續我們做集羣配置的時候,自然會配置這個節點,先寫在這裏)

2).hdfs-site.xml配置

使用命令nano hdfs-site.xml編輯hdfs-site.xml文件:

 <?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <final>true</final>
        <description>Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
        </description>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/usr/local/hadoop/namenode</value>
        <final>true</final>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/usr/local/hadoop/datanode</value>
        <final>true</final>
    </property>
</configuration>

注意:

  • 我們後續搭建集羣環境時,將配置一個Master節點和兩個Slave節點。所以dfs.replication配置爲2。
  • dfs.namenode.name.dirdfs.datanode.data.dir分別配置爲之前創建的NameNode和DataNode的目錄路徑

3).mapred-site.xml配置

Hadoop安裝文件中提供了一個mapred-site.xml.template,所以我們之前使用了命令cp mapred-site.xml.template mapred-site.xml,創建了一個mapred-site.xml文件。下面使用命令nano mapred-site.xml編輯這個文件:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>master:9001</value>
        <description>The host and port that the MapReduce job tracker runs
        at.  If "local", then jobs are run in-process as a single map
        and reduce task.
        </description>
    </property>
</configuration>

這裏只有一個配置項mapred.job.tracker,我們指向master節點機器。

4)修改JAVA_HOME環境變量

使用命令.nano hadoop-env.sh,修改如下配置:

# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-8-oracle

5.格式化 namenode

這是很重要的一步,執行命令hadoop namenode -format

4.安裝SSH

搭建集羣環境,自然少不了使用SSH。這可以實現無密碼訪問,訪問集羣機器的時候很方便。

root@8ef06706f88d:~# sudo apt-get -y install ssh
root@8ef06706f88d:~# service ssh start

SSH裝好了以後,由於我們是Docker容器中運行,所以SSH服務不會自動啓動。需要我們在容器啓動以後,手動通過/usr/sbin/sshd 手動打開SSH服務。未免有些麻煩,爲了方便,我們把這個命令加入到~/.bashrc文件中。通過nano ~/.bashrc編輯.bashrc文件(nano沒有安裝的自行安裝,也可用vi),在文件後追加下面內容:

#autorun
/usr/sbin/sshd

5.生成訪問密鑰

root@8ef06706f88d:/# cd ~/
root@8ef06706f88d:~# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
root@8ef06706f88d:~# cd .ssh
root@8ef06706f88d:~/.ssh# cat id_dsa.pub >> authorized_keys

注意: 這裏,我的思路是直接將密鑰生成後寫入鏡像,免得在買個容器裏面再單獨生成一次,還要相互拷貝公鑰,比較麻煩。當然這只是學習使用,實際操作時,應該不會這麼搞,因爲這樣所有容器的密鑰都是一樣的!!

6.保存鏡像副本

這裏我們將安裝好Hadoop的鏡像保存爲一個副本。

root@8ef06706f88d:~# exit
king@king:~$ docker commit -m "hadoop install" 8ef06706f88d ubuntu:hadoop

Hadoop分佈式集羣搭建

重點來了!

按照 hadoop 集羣的基本要求,其 中一個是 master 結點,主要是用於運行 hadoop 程序中的 namenode、secondorynamenode 和 jobtracker(新版本名字變了) 任務。用外兩個結點均爲 slave 結點,其中一個是用於冗餘目的,如果沒有冗 餘,就不能稱之爲 hadoop 了,所以模擬 hadoop 集羣至少要有 3 個結點。

前面已經將Hadoop的鏡像構建好了,下面就是使用這個鏡像搭建Master節點和Slave節點了:

節點 hostname ip 用途 Docker啓動腳本
Master master 10.0.0.5

namenode

secondaryNamenode

jobTracker

docker run -ti -h master ubuntu:hadoop
Slave slave1 10.0.0.6

datanode

taskTracker

docker run -ti -h slave1 ubuntu:hadoop
Slave slave2 10.0.0.7

datanode

taskTracker

docker run -ti -h slave2 ubuntu:hadoop

啓動Docker容器

回顧一下,Docker啓動容器使用的是run命令:

docker run -ti ubuntu:hadoop

這裏有幾個問題:

  1. Docker容器中的ip地址是啓動之後自動分配的,且不能手動更改
  2. hostname、hosts配置在容器內修改了,只能在本次容器生命週期內有效。如果容器退出了,重新啓動,這兩個配置將被還原。且這兩個配置無法通過commit命令寫入鏡像

我們搭建集羣環境的時候,需要指定節點的hostname,以及配置hosts。hostname可以使用Docker run命令的h參數直接指定。但hosts解析有點麻煩,雖然可以使用run--link參數配置hosts解析信息,但我們搭建集羣時要求兩臺機器互相能夠ping通,其中一個容器沒有啓動,那麼ip不知道,所以--link參數對於我們的這個場景不實用。要解決這個問題,大概需要專門搭建一個域名解析服務,即使用--dns參數(參考這裏)。

我們這裏只爲學習,就不整那麼複雜了,就手動修改hosts吧。只不過每次都得改,我Docker知識淺薄,一時還沒有解決這個問題。相信肯定有更好的辦法。如果有高人能指定一下,感激不盡!!

啓動master容器

docker run -ti -h master ubuntu:hadoop

啓動slave1容器

docker run -ti -h slave1 ubuntu:hadoop

啓動slave2容器

docker run -ti -h slave2 ubuntu:hadoop

配置hosts

  1. 通過ifconfig命令獲取各節點ip。環境不同獲取的ip可能不一樣,例如我本機獲取的ip如下: 
    • master:10.0.0.5
    • slave1:10.0.0.6
    • slave2:10.0.0.7
  2. 使用sudo nano /etc/hosts命令將如下配置寫入各節點的hosts文件,注意修改ip地址:

    10.0.0.5        master
    10.0.0.6        slave1
    10.0.0.7        slave2
    

配置slaves

下面我們來配置哪些節點是slave。在較老的Hadoop版本中有一個masters文件和一個slaves文件,但新版本中只有slaves文件了。

在master節點容器中執行如下命令:

root@master:~# cd $HADOOP_CONFIG_HOME/
root@master:~/soft/apache/hadoop/hadoop-2.6.0/etc/hadoop# nano slaves 

將如下slave節點的hostname信息寫入該文件:

slave1
slave2

啓動Hadoop

在master節點上執行start-all.sh命令,啓動Hadoop。

激動人心的一刻……

 

如果看到如下信息,則說明啓動成功了:

root@master:~/soft/apache/hadoop/hadoop-2.6.0/etc/hadoop# start-all.sh 
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [master]
master: starting namenode, logging to /root/soft/apache/hadoop/hadoop-2.6.0/logs/hadoop-root-namenode-master.out
slave1: starting datanode, logging to /root/soft/apache/hadoop/hadoop-2.6.0/logs/hadoop-root-datanode-slave1.out
slave2: starting datanode, logging to /root/soft/apache/hadoop/hadoop-2.6.0/logs/hadoop-root-datanode-slave2.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /root/soft/apache/hadoop/hadoop-2.6.0/logs/hadoop-root-secondarynamenode-master.out
starting yarn daemons
starting resourcemanager, logging to /root/soft/apache/hadoop/hadoop-2.6.0/logs/yarn--resourcemanager-master.out
slave1: starting nodemanager, logging to /root/soft/apache/hadoop/hadoop-2.6.0/logs/yarn-root-nodemanager-slave1.out
slave2: starting nodemanager, logging to /root/soft/apache/hadoop/hadoop-2.6.0/logs/yarn-root-nodemanager-slave2.out

在個節點上執行jps命令,結果如下:

master節點

root@master:~/soft/apache/hadoop/hadoop-2.6.0/etc/hadoop# jps
1223 Jps
992 SecondaryNameNode
813 NameNode
1140 ResourceManager

slave1節點

root@slave1:~/soft/apache/hadoop/hadoop-2.6.0/etc/hadoop# jps
258 NodeManager
352 Jps
159 DataNode

slave2節點

root@slave2:~/soft/apache/hadoop/hadoop-2.6.0/etc/hadoop# jps
371 Jps
277 NodeManager
178 DataNode

下面,我們在master節點上通過命令hdfs dfsadmin -report查看DataNode是否正常啓動:

root@master:~/soft/apache/hadoop/hadoop-2.6.0/etc/hadoop# hdfs dfsadmin -report
Configured Capacity: 167782006784 (156.26 GB)
Present Capacity: 58979344384 (54.93 GB)
DFS Remaining: 58979295232 (54.93 GB)
DFS Used: 49152 (48 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Live datanodes (2):

Name: 10.0.0.7:50010 (slave2)
Hostname: slave2
Decommission Status : Normal
Configured Capacity: 83891003392 (78.13 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 54401331200 (50.67 GB)
DFS Remaining: 29489647616 (27.46 GB)
DFS Used%: 0.00%
DFS Remaining%: 35.15%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Feb 28 07:27:05 UTC 2015


Name: 10.0.0.6:50010 (slave1)
Hostname: slave1
Decommission Status : Normal
Configured Capacity: 83891003392 (78.13 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 54401331200 (50.67 GB)
DFS Remaining: 29489647616 (27.46 GB)
DFS Used%: 0.00%
DFS Remaining%: 35.15%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Feb 28 07:27:05 UTC 2015

還可以通過Web頁面看到查看DataNode和NameNode的狀態:http://10.0.0.5:50070/ (由於我宿主機器上沒有配置master的hosts解析,所以只能用ip地址訪問,大家將ip改爲各自的master節點容器的ip即可):

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章