Hadoop在處理海量數據分析方面具有獨天優勢。今天花時間在自己的Linux上搭建了僞分佈模式,期間經歷很多曲折,現在將經驗總結如下。
首先,瞭解Hadoop的三種安裝模式:
1. 單機模式. 單機模式是Hadoop的默認模。當配置文件爲空時,Hadoop完全運行在本地。因爲不需要與其他節點交互,單機模式就不使用HDFS,也不加載任何Hadoop的守護進程。該模式主要用於開發調試MapReduce程序的應用邏輯。
2. 僞分佈模式. Hadoop守護進程運行在本地機器上,模擬一個小規模的的集羣。該模式在單機模式之上增加了代碼調試功能,允許你檢查內存使用情況,HDFS輸入輸出,以及其他的守護進程交互。
3. 全分佈模式. Hadoop守護進程運行在一個集羣上。
參考資料:
1. Ubuntu11.10下安裝Hadoop1.0.0(單機僞分佈式)
5. Ubuntu上搭建Hadoop環境(單機模式+僞分佈模式)
6. Hadoop的快速入門之 Ubuntu上搭建Hadoop環境(單機模式+僞分佈模式)
本人極力推薦5和6,這兩種教程從簡到難,步驟詳細,且有運行算例。下面我就將自己的安裝過程大致回顧一下,爲省時間,很多文字粘貼子參考資料5和6,再次感謝兩位作者分享自己的安裝經歷。另外,下面的三篇文章可以從整體上把握Hadoop的結構,使你能夠理解爲什麼要這麼這麼做。
我的安裝的是ubuntu12.o4, 用戶名derek, 機器名稱是derekUbn, Hadoop的版本Hadoop-1.1.2.tar.gz,閒話少說,步驟和每一步的圖示如下:
一、在Ubuntu下創建hadoop用戶組和用戶
1.添加hadoop用戶到系統用戶
- derek@derekUbun:~$ sudo addgroup hadoop
- derek@derekUbun:~$ sudo adduser --ingroup hadoop hadoop
2. 現在只是添加了一個用戶hadoop,它並不具備管理員權限,我們給hadoop用戶添加權限,打開/etc/sudoers文件
- derek@derekUbun:~$ sudo gedit /etc/sudoers
在root ALL=(ALL:ALL) ALL下添加hadoop ALL=(ALL:ALL) ALL
二、配置SSH
配置SSH是爲了實現各機器之間執行指令無需輸入登錄密碼。務必要避免輸入密碼,否則,主節點每次試圖訪問其他節點時,都需要手動輸入這個密碼。
SSH無密碼原理:master(namenode/jobtrack)作爲客戶端,要實現無密碼公鑰認證,連接到服務器slave(datanode/tasktracker)上時,需要在master上生成一個公鑰對,包括一個公鑰和一個私鑰,而後將公鑰複製到所有的slave上。當master通過SSH連接slave時,slave就會生成一個隨機數並用master的公鑰對隨機數進行加密,併發送給master。Master收到密鑰加密數之後再用私鑰解密,並將解密數回傳給slave,slave確認解密數無誤後就允許master進行連接了。這就是一個公鑰認證的過程,期間不需要用戶手工輸入密碼。重要過程是將客戶端master複製到slave上。1、安裝ssh
1) 由於Hadoop用ssh通信,先安裝ssh. 注意,我先從derek用戶轉到了hadoop.
- derek@derekUbun:~$ su - hadoop
- 密碼:
- hadoop@derekUbun:~$ sudo apt-get install openssh-server
- [sudo] password for hadoop:
- 正在讀取軟件包列表... 完成
- 正在分析軟件包的依賴關係樹
- 正在讀取狀態信息... 完成
- openssh-server 已經是最新的版本了。
- 下列軟件包是自動安裝的並且現在不需要了:
- kde-l10n-de language-pack-kde-de language-pack-kde-en ssh-krb5
- language-pack-de-base language-pack-kde-zh-hans language-pack-kde-en-base
- kde-l10n-engb language-pack-kde-de-base kde-l10n-zhcn firefox-locale-de
- language-pack-de language-pack-kde-zh-hans-base
- 使用'apt-get autoremove'來卸載它們
- 升級了 0 個軟件包,新安裝了 0 個軟件包,要卸載 0 個軟件包,有 505 個軟件包未被升級。
因爲我的機器已安裝最新版的ssh,因此這一步實際上什麼也沒做。
2) 假設ssh安裝完成,先啓動服務。啓動後,可以通過命令查看服務是否正確啓動:
- hadoop@derekUbun:~$ sudo /etc/init.d/ssh start
- Rather than invoking init scripts through /etc/init.d, use the service(8)
- utility, e.g. service ssh start
- Since the script you are attempting to invoke has been converted to an
- Upstart job, you may also use the start(8) utility, e.g. start ssh
- hadoop@derekUbun:~$ ps -e |grep ssh
- 759 ? 00:00:00 sshd
- 1691 ? 00:00:00 ssh-agent
- 12447 ? 00:00:00 ssh
- 12448 ? 00:00:00 sshd
- 12587 ? 00:00:00 sshd
- hadoop@derekUbun:~$
3) 作爲一個安全通信協議(ssh生成密鑰有rsa和dsa兩種生成方式,默認情況下采用rsa方式),使用時需要密碼,因此我們要設置成免密碼登錄,生成私鑰和公鑰:
- hadoop@derekUbun:~$ ssh-keygen -t rsa -P ""
- Generating public/private rsa key pair.
- Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
- /home/hadoop/.ssh/id_rsa already exists.
- Overwrite (y/n)? y
- Your identification has been saved in /home/hadoop/.ssh/id_rsa.
- Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
- The key fingerprint is:
- c7:36:c7:77:91:a2:32:28:35:a6:9f:36:dd:bd:dc:4f hadoop@derekUbun
- The key's randomart image is:
- +--[ RSA 2048]----+
- | |
- | .|
- | + . o |
- | + o. .. . .|
- | o .So=.o . .|
- | o oo+o.. . |
- | = . . . E|
- | . . . o. |
- | o .o|
- +-----------------+
- hadoop@derekUbun:~$
(注:回車後會在~/.ssh/下生成兩個文件:id_rsa和id_rsa.pub這兩個文件是成對出現的前者爲私鑰,後者爲公鑰)
進入~/.ssh/目錄下,將公鑰id_rsa.pub追加到authorized_keys授權文件中,開始是沒有authorized_keys文件的(authorized_keys 用於保存所有允許以當前用戶身份登錄到ssh客戶端用戶的公鑰內容):
- hadoop@derekUbun:~$ cat ~/.ssh/id_rsa.pub>> ~/.ssh/authorized_keys
- 現在可以登入ssh確認以後登錄時不用輸入密碼:
- hadoop@derekUbun:~$ ssh localhost
- Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-27-generic-pae i686)
- * Documentation: https://help.ubuntu.com/
- 512 packages can be updated.
- 151 updates are security updates.
- Last login: Mon Mar 11 15:56:15 2013 from localhost
- hadoop@derekUbun:~$
( 注:當ssh遠程登錄到其它機器後,現在你控制的是遠程的機器,需要執行退出命令才能重新控制本地主機。)
登出:~$ exit
這樣以後登錄就不用輸入密碼了。
- hadoop@derekUbun:~$ exit
- Connection to localhost closed.
- hadoop@derekUbun:~$
三、安裝Java
使用derek用戶,安裝java. 因爲我的電腦上已安裝java,其安裝目錄是/usr/java/jdk1.7.0_17,可以顯示我的這個安裝版本。
- hadoop@derekUbun:~$ su - derek
- 密碼:
- derek@derekUbun:~$ java -version
- java version "1.7.0_17"
- Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
- Java HotSpot(TM) Server VM (build 23.7-b01, mixed mode)
四、安裝hadoop-1.1.2
到官網下載hadoop源文件,我下載的是最新版本 jdk-7u17-linux-i586.tar.gz,將其解壓並放到希望的目錄中。我把 jdk-7u17-linux-i586.tar.gz放到/usr/local/hadoop,並將解壓後的文件夾重命名爲hadoop。
- hadoop@derekUbun:/usr/local$ sudo tar xzf hadoop-1.1.2.tar.gz (注意,我已將hadoop-1.1.2.tar.gz拷貝到usr/local/hadoop,然後轉到hadoop用戶上)
- hadoop@derekUbun:/usr/local$ sudo mv hadoop-1.1.2 /usr/local/hadoop
要確保所有的操作都是在用戶hadoop下完成的,所以將該hadoop文件夾的屬主用戶設爲hadoop
- hadoop@derekUbun:/usr/local$ sudo chown -R hadoop:hadoop hadoop
五、配置hadoop-env.sh(Java 安裝路徑)
進入用hadoop用戶登錄,進入/usr/localhadoop目錄,打開conf目錄的hadoop-env.sh,添加以下信息:(找到#export JAVA_HOME=...,去掉#,然後加上本機jdk的路徑)
export JAVA_HOME=/usr/java/jdk1.7.0_17 (視你機器的java安裝路徑而定,我的java安裝目錄是/usr/java/jdk1.7.0_17)
export HADOOP_INSTALL=/usr/local/hadoop( 注意,我這裏用的HADOOP_INSTALL,而不是HADOOP_HOME,因爲在新版中後者已經不用了。若用,會有警告)
export PATH=$PATH:/usr/local/hadoop/bin
- hadoop@derekUbun:/usr/local/hadoop$ sudo vi conf/hadoop-env.sh
- # Set Hadoop-specific environment variables here.
- # The only required environment variable is JAVA_HOME. All others are
- # optional. When running a distributed configuration it is best to
- # set JAVA_HOME in this file, so that it is correctly defined on
- # remote nodes.
- # The java implementation to use. Required.
- # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
- export JAVA_HOME=/usr/java/jdk1.7.0_17
- export HADOOP_INSTALL=/usr/local/hadoop
- export PATH=$PATH:/usr/local/hadoop/bin
- # Extra Java CLASSPATH elements. Optional.
- # export HADOOP_CLASSPATH=
- # The maximum amount of heap to use, in MB. Default is 1000.
- # export HADOOP_HEAPSIZE=2000
- # Extra Java runtime options. Empty by default.
- # export HADOOP_OPTS=-server
- "conf/hadoop-env.sh" 57L, 2356C
並且,讓環境變量配置生效source
- hadoop@derekUbun:/usr/local/hadoop$ source /usr/local/hadoop/conf/hadoop-env.sh
至此,hadoop的單機模式已經安裝成功。可以顯示Hadoop版本如下
- hadoop@derekUbun:/usr/local/hadoop$ hadoop version
- Hadoop 1.1.2
- Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782
- Compiled by hortonfo on Thu Jan 31 02:03:24 UTC 2013
- From source with checksum c720ddcf4b926991de7467d253a79b8b
- hadoop@derekUbun:/usr/local/hadoop$
現在運行一下hadoop自帶的例子WordCount來感受以下MapReduce過程:
在hadoop目錄下新建input文件夾
- hadoop@derekUbun:/usr/local/hadoop$ mkdir input
將conf中的所有文件拷貝到input文件夾中
- hadoop@derekUbun:/usr/local/hadoop$ cp conf/* input
運行WordCount程序,並將結果保存到output中
- hadoop@derekUbun:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.1.2.jar wordcount input output
運行
- hadoop@derekUbun:/usr/local/hadoop$ cat output/*
會看到conf所有文件的單詞和頻數都被統計出來。
六、 僞分佈模式的一些配置
這裏需要設定3個文件:core-site.xml hdfs-site.xml mapred-site.xml,都在/usr/local/hadoop/conf目錄下
core-site.xml: Hadoop Core的配置項,例如HDFS和MapReduce常用的I/O設置等。
hdfs-site.xml: Hadoop 守護進程的配置項,包括namenode,輔助namenode和datanode等。
mapred-site.xml: MapReduce 守護進程的配置項,包括jobtracker和tasktracker。
1.編輯三個文件:
1). core-site.xml:
- <configuration>
- <property>
- <name>fs.default.name</name>
- <value>hdfs://localhost:9000</value>
- </property>
- <property>
- <name>hadoop.tmp.dir</name>
- <value>/usr/local/hadoop/tmp</value>
- </property>
- </configuration>
2).hdfs-site.xml:
- <configuration>
- <property>
- <name>dfs.replication</name>
- <value>2</value>
- </property>
- <property>
- <name>dfs.name.dir</name>
- <value>/usr/local/hadoop/datalog1,/usr/local/hadoop/datalog2</value>
- </property>
- <property>
- <name>dfs.data.dir</name>
- <value>/usr/local/hadoop/data1,/usr/local/hadoop/data2</value>
- </property>
- </configuration>
3). mapred-site.xml:
- <configuration>
- <property>
- <name>mapred.job.tracker</name>
- <value>localhost:9001</value>
- </property>
- </configuration>
2. 啓動Hadoop到相關服務,格式化namenode, secondarynamenode, tasktracker:
- hadoop@derekUbun:/usr/local/hadoop$ source /usr/local/hadoop/conf/hadoop-env.sh
- hadoop@derekUbun:/usr/local/hadoop$ hadoop namenode -format
看到下面的信息就說明hdfs文件系統格式化成功了
- 13/03/11 23:08:01 INFO common.Storage: Storage directory /usr/local/hadoop/datalog2 has been successfully formatted.
- 13/03/11 23:08:01 INFO namenode.NameNode: SHUTDOWN_MSG:
- /************************************************************
- SHUTDOWN_MSG: Shutting down NameNode at derekUbun/127.0.1.1
- ************************************************************/
3. 啓動Hadoop
接着執行start-all.sh來啓動所有服務,包括namenode,datanode,start-all.sh腳本用來裝載守護進程。用Java的jps命令列出所有守護進程來驗證安裝成功,出現如下列表,表明成功.
- hadoop@derekUbun:/usr/local/hadoop$ cd bin
- hadoop@derekUbun:/usr/local/hadoop/bin$ start-all.sh
- starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-namenode-derekUbun.out
- localhost: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-datanode-derekUbun.out
- localhost: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-secondarynamenode-derekUbun.out
- starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-jobtracker-derekUbun.out
- localhost: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-derekUbun.out
- hadoop@derekUbun:/usr/local/hadoop/bin$
用Java的jps命令列出所有守護進程來驗證安裝成功
- hadoop@derekUbun:/usr/local/hadoop$ jps
出現如下列表,表明成功
- hadoop@derekUbun:/usr/local/hadoop$ jps
- 8431 JobTracker
- 8684 TaskTracker
- 7821 NameNode
- 8915 Jps
- 8341 SecondaryNameNode
- hadoop@derekUbun:/usr/local/hadoop$
4. 檢查運行狀態
所有的設置已完成,Hadoop也啓動了,現在可以通過下面的操作來查看服務是否正常,在Hadoop中用於監控集羣健康狀態的Web界面:
http://localhost:50030/ - Hadoop 管理介面
http://localhost:50060/ - Hadoop Task Tracker 狀態
http://localhost:50070/ - Hadoop DFS 狀態
至此,hadoop的僞分佈模式已經安裝成功,於是,再次在僞分佈模式下運行一下hadoop自帶的例子WordCount來感受以下MapReduce過程:
這時注意程序是在文件系統dfs運行的,創建的文件也都基於文件系統:
首先在dfs中創建input目錄
- hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -mkdir input
將conf中的文件拷貝到dfs中的input
- hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -copyFromLocal conf/* input
(注:可以使用查看和刪除hadoop dfs中的文件)
在僞分佈式模式下運行WordCount
- hadoop jar hadoop-examples-1.1.2.jar wordcount input output
- hadoop@derekUbun:/usr/local/hadoop$ hadoop jar hadoop-examples-1.1.2.jar wordcount input output
- 13/03/12 09:26:05 INFO input.FileInputFormat: Total input paths to process : 16
- 13/03/12 09:26:05 INFO util.NativeCodeLoader: Loaded the native-hadoop library
- 13/03/12 09:26:05 WARN snappy.LoadSnappy: Snappy native library not loaded
- 13/03/12 09:26:05 INFO mapred.JobClient: Running job: job_201303120920_0001
- 13/03/12 09:26:06 INFO mapred.JobClient: map 0% reduce 0%
- 13/03/12 09:26:10 INFO mapred.JobClient: map 12% reduce 0%
- 13/03/12 09:26:13 INFO mapred.JobClient: map 25% reduce 0%
- 13/03/12 09:26:15 INFO mapred.JobClient: map 37% reduce 0%
- 13/03/12 09:26:17 INFO mapred.JobClient: map 50% reduce 0%
- 13/03/12 09:26:18 INFO mapred.JobClient: map 62% reduce 0%
- 13/03/12 09:26:19 INFO mapred.JobClient: map 62% reduce 16%
- 13/03/12 09:26:20 INFO mapred.JobClient: map 75% reduce 16%
- 13/03/12 09:26:22 INFO mapred.JobClient: map 87% reduce 16%
- 13/03/12 09:26:24 INFO mapred.JobClient: map 100% reduce 16%
- 13/03/12 09:26:28 INFO mapred.JobClient: map 100% reduce 29%
- 13/03/12 09:26:30 INFO mapred.JobClient: map 100% reduce 100%
- 13/03/12 09:26:30 INFO mapred.JobClient: Job complete: job_201303120920_0001
- 13/03/12 09:26:30 INFO mapred.JobClient: Counters: 29
- 13/03/12 09:26:30 INFO mapred.JobClient: Job Counters
- 13/03/12 09:26:30 INFO mapred.JobClient: Launched reduce tasks=1
- 13/03/12 09:26:30 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=29912
- 13/03/12 09:26:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
- 13/03/12 09:26:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
- 13/03/12 09:26:30 INFO mapred.JobClient: Launched map tasks=16
- 13/03/12 09:26:30 INFO mapred.JobClient: Data-local map tasks=16
- 13/03/12 09:26:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=19608
- 13/03/12 09:26:30 INFO mapred.JobClient: File Output Format Counters
- 13/03/12 09:26:30 INFO mapred.JobClient: Bytes Written=15836
- 13/03/12 09:26:30 INFO mapred.JobClient: FileSystemCounters
- 13/03/12 09:26:30 INFO mapred.JobClient: FILE_BYTES_READ=23161
- 13/03/12 09:26:30 INFO mapred.JobClient: HDFS_BYTES_READ=29346
- 13/03/12 09:26:30 INFO mapred.JobClient: FILE_BYTES_WRITTEN=944157
- 13/03/12 09:26:30 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=15836
- 13/03/12 09:26:30 INFO mapred.JobClient: File Input Format Counters
- 13/03/12 09:26:30 INFO mapred.JobClient: Bytes Read=27400
- 13/03/12 09:26:30 INFO mapred.JobClient: Map-Reduce Framework
- 13/03/12 09:26:30 INFO mapred.JobClient: Map output materialized bytes=23251
- 13/03/12 09:26:30 INFO mapred.JobClient: Map input records=778
- 13/03/12 09:26:30 INFO mapred.JobClient: Reduce shuffle bytes=23251
- 13/03/12 09:26:30 INFO mapred.JobClient: Spilled Records=2220
- 13/03/12 09:26:30 INFO mapred.JobClient: Map output bytes=36314
- 13/03/12 09:26:30 INFO mapred.JobClient: Total committed heap usage (bytes)=2736914432
- 13/03/12 09:26:30 INFO mapred.JobClient: CPU time spent (ms)=6550
- 13/03/12 09:26:30 INFO mapred.JobClient: Combine input records=2615
- 13/03/12 09:26:30 INFO mapred.JobClient: SPLIT_RAW_BYTES=1946
- 13/03/12 09:26:30 INFO mapred.JobClient: Reduce input records=1110
- 13/03/12 09:26:30 INFO mapred.JobClient: Reduce input groups=804
- 13/03/12 09:26:30 INFO mapred.JobClient: Combine output records=1110
- 13/03/12 09:26:30 INFO mapred.JobClient: Physical memory (bytes) snapshot=2738036736
- 13/03/12 09:26:30 INFO mapred.JobClient: Reduce output records=804
- 13/03/12 09:26:30 INFO mapred.JobClient: Virtual memory (bytes) snapshot=6773346304
- 13/03/12 09:26:30 INFO mapred.JobClient: Map output records=2615
- hadoop@derekUbun:/usr/local/hadoop$
顯示輸出結果
- hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -cat output/*
當Hadoop結束時,可以通過stop-all.sh腳本來關閉Hadoop的守護進程
- hadoop@derekUbun:/usr/local/hadoop$ bin/stop-all.sh
現在,開始Hadoop之旅,實現一些算法吧!
註記:
1. 在僞分佈模式,可以通過hadoop dfs -ls 查看input裏的內容
2. 在僞分佈模式,可以通過hadoop dfs -rmr 查看input裏的內容
3. 在僞分佈模式,input和output都在hadoop dfs文件裏
地址:http://blog.csdn.net/zhaoyl03/article/details/8657104