Hadoop 2.8.5 HA 集羣安裝筆記

1 目的

本文記錄怎麼安裝和配置 Hadoop 高可用集羣。

2 準備

wget -O /mnt/softs/hadoop-2.8.5.tar.gz http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz

原理說明

yarn 是用於 Hadoop 分佈式計算時服務器的資源的監控和管理的,只使用 Hadoop 存儲服務時,可以不安裝啓動 yarn 相關的服務。

Zookeeper 是分佈式管理協作框架,Zookeeper集羣用來保證Hadoop集羣的高可用。ZooKeeper是一個協調服務,幫助 ZKFC 執行主 NameNode 的選舉。

高可用 Hadoop 集羣中有兩個 NameNode,一個處於 Active 狀態,對外提供服務,一個處於 standby 狀態,不對外提供服務,只實時同步 Active 狀態的 NameNode 的數據。兩個 NameNode 都定時給 Zookeeper 集羣發送心跳,報告自己的狀態。一旦 Zookeeper 檢測不到 Active NameNode 心跳,就判定 Active NameNode 服務不可用,就將 Standby NameNode 的狀態切換爲 Active 狀態,以保證 Hadoop 集羣一個 NameNode 不可用,還有一個NameNode 可用,實現了 Hadoop 集羣 NameNode 的高可用。

DataNode 用於存儲每個文件的“數據塊”數據,並且會週期性地向NameNode報告該DataNode的數據存放情況。

Standby NameNode 同步 Active NameNode 的數據是通過 Hadoop 的 JournalNode 服務來實現的。

Yarn HA 原理:在 Hadoop 2.4 版本以前,ResourceManager 存在單點故障問題,ResourceManager 記錄着當前集羣的資源分配和 Job 運行狀態,Yarn HA 利用共享存儲介質存儲這些信息來達到高可用,另外利用 Zookeeper 實現故障的自動轉移。

ZKFC 是需要和 NameNode 一一對應的服務,即每個 NameNode 都需要部署 ZKFC。它負責監控 NameNode 的狀態,並及時把狀態寫入 Zookeeper。ZKFC 也有選擇誰作爲 Active NameNode 的權利。

在這裏插入圖片描述

2.1 Yarn 組件

Yarn 是 Hadoop 集羣的資源管理系統。Hadoop2.0 對 MapReduce 框架做了徹底的設計重構。
它的基本設計思想是將 MRv1 中的 JobTracker 拆分成了兩個獨立的服務:一個全局的資源管理器 ResourceManager 和每個應用程序特有的 ApplicationMaster。

其中 ResourceManager 負責整個系統的資源管理和分配,而 ApplicationMaster 負責單個應用程序的管理。

Yarn 基本組成結構

  • Yarn 總體上是 Master/Slave 結構,ResourceManager 爲 Master,NodeManager 爲 Slave。
  • ResourceManager 負責對各個 NodeManager 上的資源進行統一管理和調度。當用戶提交一個應用程序時,需要提供一個用以跟蹤和管理這個程序的 ApplicationMaster,它負責向 ResourceManager 申請資源,並要求 NodeManger 啓動可以佔用一定資源的任務。由於不同的 ApplicationMaster 被分佈到不同的節點上,因此它們之間不會相互影響。

Yarn
Yarn 主要由以下組件組成:

序號 組件 描述
1 ResourceManager 全局的資源管理器,負責整個系統的資源管理和分配。它主要由兩個組件構成:調度器(Scheduler)和應用程序管理器(Applications Manager)。
2 NodeManager 每個節點上的資源和任務管理器,一方面,它會定時地向 ResourceManager 彙報本節點上的資源使用情況和各個 Container 的運行狀態;另一方面,它接收並處理來自ApplicationMaster 的 Container 啓動/停止等各種請求。
3 ApplicationMaster 用戶提交的每個應用程序均包含一個 ApplicationMaster,主要功能包括:
1、與ResourceManager Scheduler 協商以獲取資源(Container);
2、將得到的任務進一步分配給內部任務;
3、與 NodeManager 通信以啓動或停止任務;
4、監控所有任務運行狀態,並在任務運行失敗時重新爲任務申請資源以重啓任務
4 Container Yarn 中的資源抽象單元,封裝某個節點的內存、CPU、磁盤、網絡等多維度資源,ResourceManager 以 Container 爲單位返回給 ApplicationMaster 申請的資源。Yarn 會爲每個任務分配一個 Container,且該任務只能使用該 Container 中描述的資源。Container 是一個動態資源劃分單位,是根據應用程序的需求動態生成的,目前 Yarn 只支持 CPU 和內存兩種資源,使用輕量級資源隔離機制 Cgroups。

ResourceManager 的兩個組件:

1)調度器(Scheduler)

調度器根據容量、隊列等限制條件(如每個隊列分配一定的資源,最多執行一定數量的作業等),將系統中的資源分配給各個正在運行的應用程序。

需要注意的是,該調度器是一個“純調度器”,它不再從事任何與具體應用程序相關的工作,比如不負責監控或者跟蹤應用的執行狀態等,也不負責重新啓動因應用執行失敗或者硬件故障而產生的失敗任務,這些均交由應用程序相關的 ApplicationMaster 完成。調度器僅根據各個應用程序的資源需求進行資源分配,而資源分配單位用一個抽象概念“資源容器”(Resource Container,簡稱 Container)表示,Container 是一個動態資源分配單位,它將內存、CPU、磁盤、網絡等資源封裝在一起,從而限定每個任務使用的資源量。此外,該調度器是一個可插拔的組件,用戶可根據自己的需要設計新的調度器,YARN提供了多種直接可用的調度器,比如 Fair Scheduler 和Capacity Scheduler 等。

2) 應用程序管理器(Application Manager)

應用程序管理器負責管理整個系統中所有應用程序,包括應用程序提交、與調度器協商資源以啓動 ApplicationMaster、監控 ApplicationMaster 運行狀態並在失敗時重新啓動它等。

3 安裝

在典型的部署中,Zookeeper 一般配置3到5個節點。由於 Zookeeper 服務對環境較低的要求,可以將 Zookeeper節點部署到與 HDFS NameNode 和 Standby Node 相同的硬件環境上。許多用戶選擇將第三個 Zookeeper 進程部署到與 Yarn ResourceManager 相同的節點上。爲獲取最佳的性能和數據隔離,建議配置 Zookeeper 數據存儲與HDFS metadata 位於不同的磁盤上。

由於 JournalNode 是一個輕量級的守護進程,可以與 Hadoop 其它服務共用機器。建議將 JournalNode 部署在控制節點上,以避免數據節點在進行大數據量傳輸時引起JournalNode寫入失敗。

由於 NameNode 會在內存中維護所有文件的每個數據塊的引用,因此,namenode 很可能會吃光分配給它的所有內存。1000 MB 內存(默認配置)通常足夠管理數百萬個文件。但是根據經驗來看,保守估計需要爲每一百萬個數據塊分配 1000 MB 內存空間。

3.1 集羣規劃

192.168.3.198 192.168.3.199 192.168.3.200
主機名 im-test-hadoop01 im-test-hadoop02 im-test-hadoop03
安裝軟件 Hadoop Hadoop Hadoop
ZooKeeper ZooKeeper ZooKeeper
運行進程 NameNode(active) NameNode(standby) --
DataNode DataNode DataNode
JournalNode JournalNode JournalNode
-- ResourceManager ResourceManager
NodeManager NodeManager NodeManager
ZKFC ZKFC --
QuorumPeerMain
(ZooKeeper)
QuorumPeerMain
(ZooKeeper)
QuorumPeerMain
(ZooKeeper)

JournalNode 是 QJM 存儲段進程,提供日誌讀寫,存儲,修復等服務。
搭建奇數節點的 JournalNode 實現 NameNode 主備之間的元數據信息的同步。
QJM的基本原理就是用2N+1臺JournalNode存儲EditLog,每次寫數據操作有大多數(>=N+1)返回成功時即認爲該次寫成功,數據不會丟失了。當然這個算法所能容忍的是最多有N臺機器掛掉,如果多於N臺掛掉,這個算法就失效了。

3.2 安裝步驟

解壓 hadoop 安裝包

tar zxf /mnt/softs/hadoop-2.8.5.tar.gz -C /mnt/softwares/

配置 core-site.xml

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://cluster1</value>
  <description>Hadoop FS 客戶端使用的默認路徑前綴。hdfs://nameservices ID</description>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/mnt/hadoop/tmp</value>
  <description>hadoop 臨時目錄,默認NameNode 和 DataNode 的數據文件都會存在該目錄下的對應的子目錄下,Hadoop MapReduce 運算的目錄也在該臨時目錄下。默認系統臨時目錄,重啓會丟失數據,建議改爲自定義目錄。有的配置在 hdfs-yarn.xml 文件中</description>
</property>

<property>
  <name>io.file.buffer.size</name>
  <value>131072</value>
  <description>文件緩存大小,單位 byte。默認4096,一般配置成系統頁面大小的倍數</description>
</property>

<property>
  <name>fs.trash.interval</name>
  <value>1440</value>
  <description>默認0,表示不開啓垃圾箱功能,大於0時,表示刪除的文件在垃圾箱中存留的分鐘數,</description>
</property>

<property>
  <name>fs.trash.checkpoint.interval</name>
  <value>1440</value>
  <description>垃圾回收檢查的時間間隔,值應該小於等於 fs.trash.interval。默認0,此時按fs.trash.interval的值大小執行</description>
</property>

<property>
  <name>ha.zookeeper.quorum</name>
  <value>im-test-hadoop01:2181,im-test-hadoop02:2181,im-test-hadoop03:2181</value>
  <description>ZooKeeper 服務地址列表,ZKFC在自動故障轉移時使用</description>
</property>

配置 hdfs-site.xml

<property>
  <name>dfs.nameservices</name>
  <value>cluster1</value>
  <description>指定hdfs的nameservice,需要和core-site.xml中的保持一致,多個時逗號間隔</description>
</property>

<property>
  <name>dfs.ha.namenodes.cluster1</name>
  <value>nn1,nn2</value>
  <description>配置 hdfs 下的 namenode 節點名</description>
</property>

<property>
  <name>dfs.namenode.rpc-address.cluster1.nn1</name>
  <value>im-test-hadoop01:9000</value>
  <description>nn1 的 rpc 通信地址,用來和 datanode 通信</description>
</property>

<property>
  <name>dfs.namenode.http-address.cluster1.nn1</name>
  <value>im-test-hadoop01:50070</value>
  <description>nn1 的 http 通信地址,hdfs namenode web 頁面訪問地址</description>
</property>

<property>
  <name>dfs.namenode.rpc-address.cluster1.nn2</name>
  <value>im-test-hadoop02:9000</value>
  <description>nn2 的 rpc 通信地址,用來和 datanode 通信</description>
</property>

<property>
  <name>dfs.namenode.http-address.cluster1.nn2</name>
  <value>im-test-hadoop02:50070</value>
  <description>nn2 的 http 通信地址,hdfs namenode web 頁面訪問地址</description>
</property>

<property>
  <name>dfs.namenode.shared.edits.dir</name>
  <value>qjournal://im-test-hadoop01:8485;im-test-hadoop02:8485;im-test-hadoop03:8485/cluster1</value>
  <description> NameNode 間讀寫共享 edits log 的 JournalNode 的 URI 列表, Active NameNode 寫入,Standby NameNode 讀取。cluster1 是 journal ID,建議和 nameservice ID 一致</description>
</property>

<property>
  <name>dfs.namenode.name.dir</name>
  <value>/mnt/hadoop/dfs/name</value>
  <description>配置 NameNode 數據(fsimage)的本地存儲目錄位置,file://${hadoop.tmp.dir}/dfs/name</description>
</property>

<property>
  <name>dfs.datanode.data.dir</name>
  <value>/mnt/hadoop/dfs/data</value>
  <description>配置 DataNode 數據(hdfs blocks)的本地存儲目錄,file://${hadoop.tmp.dir}/dfs/data</description>
</property>

<property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/mnt/hadoop/journal/data</value>
  <description>JournalNode 存儲 edits 和其他狀態數據目錄位置</description>
</property>

<property>
  <name>dfs.namenode.handler.count</name>
  <value>100</value>
  <description>監聽來自 DataNode 客戶端請求的 NameNode RPC 服務器線程數,默認10</description>
</property>

<property>
  <name>dfs.blocksize</name>
  <value>134217728</value>
  <description>新文件默認塊大小,默認值爲134217728(128M)</description>
</property>

<property>
  <name>dfs.ha.automatic-failover.enabled</name>
  <value>true</value>
  <description>是否啓用 NameNode 故障自動轉移,默認 fasle</description>
</property>

<property>
  <name>dfs.client.failover.proxy.provider.cluster1</name>
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  <description>HDFS 客戶端聯繫 Active NameNode 的 Java 類,默認提供了兩個實現類,另外一個是 RequestHedgingProxyProvider,可以自定義實現該類。該類用來識別當前 Active NameNode</description>
</property>

<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value>
  <description>在 failover 期間用來隔離 Active Namenode 的腳本或者java 類列表</description>
</property>

<property>
  <name>dfs.ha.fencing.ssh.private-key-files</name>
  <value>/home/hadoop/.ssh/id_rsa</value>
  <description>SSH 免密登錄私鑰文件路徑,使用sshfence隔離機制時才需要配置</description>
</property>

<property>
  <name>dfs.replication</name>
  <value>3</value>
  <description>hdfs 副本數</description>
</property>

雖然 JournalNodes可以確保集羣中只有一個 Active NameNode 寫入 edits,這對保護 edits 一致性很重要,但是在 failover 期間,有可能 Acitive NameNode 仍然存活,Client 可能還與其保持連接提供舊的數據服務,我們可以通過此配置,指定 shell 腳本或者 java 程序,SSH 到 Active NameNode 然後 Kill NameNode 進程。它有兩種可選值:

  • sshfence:SSH登錄到 Active NameNode,並 Kill 此進程。首先當前機器能夠使用 SSH 登錄到遠端,前提是已經授權(rsa)。
  • shell:運行 shell 指令隔離 Active NameNode。

配置 mapred-site.xml
複製 map-site.xml.template 爲 mapred-site.xml 並編輯配置

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
  <description>執行 MapReduce Job 的運行時框架,默認 local,可以是 local、classic 和 yarn</description>
</property>

<property>
  <name>mapreduce.jobhistory.address</name>
  <value>im-test-hadoop:10020</value>
  <description>MapReduce JobHistory 服務地址</description>
</property>

<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>im-test-hadoop:19888</value>
  <description>MapReduce JobHistory 頁面訪問地址</description>
</property>

歷史服務(JobHistory)介紹
Hadoop開啓歷史服務可以在web頁面上查看Yarn上執行job情況的詳細信息。可以通過歷史服務器查看已經運行完的Mapreduce作業記錄,比如用了多少個Map、用了多少個Reduce、作業提交時間、作業啓動時間、作業完成時間等信息。

配置 yarn-site.xml

ResourceManager 高可用最少配置示例

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
  <description>NodeManager上運行的附屬服務。需配置成mapreduce_shuffle,纔可運行MapReduce程序</description>
</property>

<property>
  <name>yarn.resourcemanager.ha.enabled</name>
  <value>true</value>
  <description>是否啓用 ResourceManager 高可用,默認 false</description>
</property>

<property>
  <name>yarn.resourcemanager.cluster-id</name>
  <value>cluster1</value>
  <description>ResourceManager 集羣名</description>
</property>
<property>
  <name>yarn.resourcemanager.ha.rm-ids</name>
  <value>rm1,rm2</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname.rm1</name>
  <value>im-test-hadoop02</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname.rm2</name>
  <value>im-test-hadoop03</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address.rm1</name>
  <value>im-test-hadoop02:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address.rm2</name>
  <value>im-test-hadoop03:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.zk-address</name>
  <value>im-test-hadoop01:2181,im-test-hadoop02:2181,im-test-hadoop03:2181</value>
</property>

<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
  <description>是否開啓日誌聚集功能,默認false。應用執行完成後,Log Aggregation 收集每個 Container 的日誌到 HDFS 上</description>
</property>

<property>
  <name>yarn.log-aggregation.retain-seconds</name>
  <value>25200</value>
  <description>聚集日誌最長保留時間</description>
</property>

配置 slaves
添加編輯 $HADOOP_HOME/etc/hadoop/slaves,slaves 文件指定集羣中有哪些 DataNode 節點:

im-test-hadoop01
im-test-hadoop02
im-test-hadoop03

如果需要重新格式化 NameNode,需要先將原來 NameNode 和 DataNode 下的文件全部刪除,不然會報錯,NameNode 和 DataNode 所在目錄是在 core-site.xml 中 hadoop.tmp.dir、dfs.namenode.name.dir、dfs.datanode.data.dir屬性配置的。因爲每次格式化,默認是創建一個集羣ID,並寫入NameNode和DataNode的VERSION文件中(VERSION文件所在目錄爲dfs/name/current 和 dfs/data/current),重新格式化時,默認會生成一個新的集羣ID,如果不刪除原來的目錄,會導致namenode中的VERSION文件中是新的集羣ID,而DataNode中是舊的集羣ID,不一致時會報錯。

另一種方法是格式化時指定集羣ID參數,指定爲舊的集羣ID。

分發 Hadoop 到集羣中的其它服務器

scp -rp /mnt/softwares/hadoop-2.8.5 im-test-hadoop02:/mnt/softwares/
scp -rp /mnt/softwares/hadoop-2.8.5 im-test-hadoop03:/mnt/softwares/

3.3 啓動

第一步,啓動 Zookeeper 服務
參照 Zookeeper 安裝筆記

查看服務:

[hadoop@im-test-hadoop01 ~]$ jps
9024 Jps
21699 QuorumPeerMain
[hadoop@im-test-hadoop02 ~]$ jps
4866 Jps
16348 QuorumPeerMain
[hadoop@im-test-hadoop03 ~]$ jps
12835 Jps
24584 QuorumPeerMain

第二步,在 Zookeeper 中創建一個存儲 NameNode HA 相關數據的 zNode

預先查看 Zookeeper 節點:

$ZOOKEEPER_HOME/bin/zkCli.sh
[zk: localhost:2181(CONNECTED) 0] ls /
[zookeeper]

只有 zookeeper 一個節點。執行下面的命令:

$HADOOP_PREFIX/bin/hdfs zkfc -formatZK

部分執行日誌:

18/11/01 11:22:57 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=im-test-hadoop01:2181,im-test-hadoop02:2181,im-test-hadoop03:2181 sessionTimeout=10000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a78afa0
18/11/01 11:22:57 INFO zookeeper.ClientCnxn: Opening socket connection to server im-test-hadoop02/192.168.3.199:2181. Will not attempt to authenticate using SASL (unknown error)
18/11/01 11:22:57 INFO zookeeper.ClientCnxn: Socket connection established to im-test-hadoop02/192.168.3.199:2181, initiating session
18/11/01 11:22:57 INFO zookeeper.ClientCnxn: Session establishment complete on server im-test-hadoop02/192.168.3.199:2181, sessionid = 0x200ba6364c30002, negotiated timeout = 10000
18/11/01 11:22:57 INFO ha.ActiveStandbyElector: Session connected.
18/11/01 11:22:57 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/cluster1 in ZK.
18/11/01 11:22:57 INFO zookeeper.ZooKeeper: Session: 0x200ba6364c30002 closed
18/11/01 11:22:57 INFO zookeeper.ClientCnxn: EventThread shut down
18/11/01 11:22:57 INFO tools.DFSZKFailoverController: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down DFSZKFailoverController at im-test-hadoop01/192.168.3.198
************************************************************/

上面的命令將在 Zookeeper 中創建一個 znode,存儲自動故障轉移系統的數據。
再次 Zookeeper 集羣的一臺服務器上登錄 Zookeeper,查看

[zk: localhost:2181(CONNECTED) 1] ls /
[zookeeper, hadoop-ha]

發現多了一個 hadoop-ha 節點,說明執行成功。

Hadoop 配置文件不同步時,執行配置文件複製:
scp -pqr /mnt/softwares/hadoop-2.8.5/etc/hadoop im-test-hadoop02:/mnt/softwares/hadoop-2.8.5/etc/
scp -pqr /mnt/softwares/hadoop-2.8.5/etc/hadoop im-test-hadoop03:/mnt/softwares/hadoop-2.8.5/etc/

第三步,啓動 HDFS集羣服務

啓動所有 JournalNode 進程:

/mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh start journalnode

啓動後結果:

[hadoop@im-test-hadoop01 ~]$ jps
21699 QuorumPeerMain
12633 Jps
12559 JournalNode
[hadoop@im-test-hadoop02 ~]$ jps
7879 Jps
16348 QuorumPeerMain
7805 JournalNode
[hadoop@im-test-hadoop03 ~]$ jps
15973 Jps
24584 QuorumPeerMain
15899 JournalNode

格式化 Active NameNode:

[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hdfs namenode -format
18/11/01 14:30:27 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   user = hadoop
STARTUP_MSG:   host = im-test-hadoop01/192.168.3.198
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.8.5
STARTUP_MSG:   classpath = /mnt/softwares/hadoop-2.8.5/etc/hadoop 。。。。// 該行內容省略
STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 0b8464d75227fcee2c6e7f2410377b3d53d3d5f8; compiled by 'jdu' on 2018-09-10T03:32Z
STARTUP_MSG:   java = 1.8.0_181
************************************************************/
18/11/01 14:30:27 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
18/11/01 14:30:27 INFO namenode.NameNode: createNameNode [-format]
18/11/01 14:30:28 WARN common.Util: Path /mnt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
18/11/01 14:30:28 WARN common.Util: Path /mnt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
Formatting using clusterid: CID-fd4c6a4f-ce09-4cb3-b064-c70a2212a2bb
18/11/01 14:30:28 INFO namenode.FSEditLog: Edit logging is async:true
18/11/01 14:30:28 INFO namenode.FSNamesystem: KeyProvider: null
18/11/01 14:30:28 INFO namenode.FSNamesystem: fsLock is fair: true
18/11/01 14:30:28 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
18/11/01 14:30:28 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
18/11/01 14:30:28 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
18/11/01 14:30:28 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
18/11/01 14:30:28 INFO blockmanagement.BlockManager: The block deletion will start around 2018 十一月 01 14:30:28
18/11/01 14:30:28 INFO util.GSet: Computing capacity for map BlocksMap
18/11/01 14:30:28 INFO util.GSet: VM type       = 64-bit
18/11/01 14:30:28 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
18/11/01 14:30:28 INFO util.GSet: capacity      = 2^21 = 2097152 entries
18/11/01 14:30:28 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
18/11/01 14:30:28 INFO blockmanagement.BlockManager: defaultReplication         = 3
18/11/01 14:30:28 INFO blockmanagement.BlockManager: maxReplication             = 512
18/11/01 14:30:28 INFO blockmanagement.BlockManager: minReplication             = 1
18/11/01 14:30:28 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
18/11/01 14:30:28 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
18/11/01 14:30:28 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
18/11/01 14:30:28 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
18/11/01 14:30:28 INFO namenode.FSNamesystem: fsOwner             = hadoop (auth:SIMPLE)
18/11/01 14:30:28 INFO namenode.FSNamesystem: supergroup          = supergroup
18/11/01 14:30:28 INFO namenode.FSNamesystem: isPermissionEnabled = true
18/11/01 14:30:28 INFO namenode.FSNamesystem: Determined nameservice ID: cluster1
18/11/01 14:30:28 INFO namenode.FSNamesystem: HA Enabled: true
18/11/01 14:30:28 INFO namenode.FSNamesystem: Append Enabled: true
18/11/01 14:30:28 INFO util.GSet: Computing capacity for map INodeMap
18/11/01 14:30:28 INFO util.GSet: VM type       = 64-bit
18/11/01 14:30:28 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
18/11/01 14:30:28 INFO util.GSet: capacity      = 2^20 = 1048576 entries
18/11/01 14:30:28 INFO namenode.FSDirectory: ACLs enabled? false
18/11/01 14:30:28 INFO namenode.FSDirectory: XAttrs enabled? true
18/11/01 14:30:28 INFO namenode.NameNode: Caching file names occurring more than 10 times
18/11/01 14:30:28 INFO util.GSet: Computing capacity for map cachedBlocks
18/11/01 14:30:28 INFO util.GSet: VM type       = 64-bit
18/11/01 14:30:28 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
18/11/01 14:30:28 INFO util.GSet: capacity      = 2^18 = 262144 entries
18/11/01 14:30:28 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
18/11/01 14:30:28 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
18/11/01 14:30:28 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
18/11/01 14:30:28 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
18/11/01 14:30:28 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
18/11/01 14:30:28 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
18/11/01 14:30:28 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
18/11/01 14:30:28 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
18/11/01 14:30:28 INFO util.GSet: Computing capacity for map NameNodeRetryCache
18/11/01 14:30:28 INFO util.GSet: VM type       = 64-bit
18/11/01 14:30:28 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
18/11/01 14:30:28 INFO util.GSet: capacity      = 2^15 = 32768 entries
18/11/01 14:30:29 INFO namenode.FSImage: Allocated new BlockPoolId: BP-206245647-192.168.3.198-1541053829034
18/11/01 14:30:29 INFO common.Storage: Storage directory /mnt/hadoop/dfs/name has been successfully formatted.
18/11/01 14:30:29 INFO namenode.FSImageFormatProtobuf: Saving image file /mnt/hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
18/11/01 14:30:29 INFO namenode.FSImageFormatProtobuf: Image file /mnt/hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 323 bytes saved in 0 seconds.
18/11/01 14:30:29 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
18/11/01 14:30:29 INFO util.ExitUtil: Exiting with status 0
18/11/01 14:30:29 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at im-test-hadoop01/192.168.3.198
************************************************************/

啓動 Active NameNode:

[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh start namenode
starting namenode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-namenode-im-test-hadoop01.out
[hadoop@im-test-hadoop01 ~]$ jps
21699 QuorumPeerMain
13060 Jps
12870 NameNode
12559 JournalNode

同步 Standby NameNode:

[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hdfs namenode -bootstrapStandby
18/11/01 14:38:44 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   user = hadoop
STARTUP_MSG:   host = im-test-hadoop02/192.168.3.199
STARTUP_MSG:   args = [-bootstrapStandby]
STARTUP_MSG:   version = 2.8.5
STARTUP_MSG:   classpath = /mnt/softwares/hadoop-2.8.5/etc/hadoop ..............\\ 改行內容省略
STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 0b8464d75227fcee2c6e7f2410377b3d53d3d5f8; compiled by 'jdu' on 2018-09-10T03:32Z
STARTUP_MSG:   java = 1.8.0_181
************************************************************/
18/11/01 14:38:44 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
18/11/01 14:38:44 INFO namenode.NameNode: createNameNode [-bootstrapStandby]
18/11/01 14:38:45 WARN common.Util: Path /mnt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
18/11/01 14:38:45 WARN common.Util: Path /mnt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
=====================================================
About to bootstrap Standby ID nn2 from:
           Nameservice ID: cluster1
        Other Namenode ID: nn1
  Other NN's HTTP address: http://im-test-hadoop01:50070
  Other NN's IPC  address: im-test-hadoop01/192.168.3.198:9000
             Namespace ID: 448464711
            Block pool ID: BP-206245647-192.168.3.198-1541053829034
               Cluster ID: CID-fd4c6a4f-ce09-4cb3-b064-c70a2212a2bb
           Layout version: -63
       isUpgradeFinalized: true
=====================================================
18/11/01 14:38:45 INFO common.Storage: Storage directory /mnt/hadoop/dfs/name has been successfully formatted.
18/11/01 14:38:45 WARN common.Util: Path /mnt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
18/11/01 14:38:45 WARN common.Util: Path /mnt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
18/11/01 14:38:45 INFO namenode.FSEditLog: Edit logging is async:true
18/11/01 14:38:45 INFO namenode.TransferFsImage: Opening connection to http://im-test-hadoop01:50070/imagetransfer?getimage=1&txid=0&storageInfo=-63:448464711:1541053829034:CID-fd4c6a4f-ce09-4cb3-b064-c70a2212a2bb&bootstrapstandby=true
18/11/01 14:38:45 INFO namenode.TransferFsImage: Image Transfer timeout configured to 60000 milliseconds
18/11/01 14:38:45 INFO namenode.TransferFsImage: Transfer took 0.01s at 0.00 KB/s
18/11/01 14:38:45 INFO namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000000000 size 323 bytes.
18/11/01 14:38:46 INFO util.ExitUtil: Exiting with status 0
18/11/01 14:38:46 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at im-test-hadoop02/192.168.3.199
************************************************************/

啓動 Standby NameNode:

[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh start namenode
starting namenode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-namenode-im-test-hadoop02.out

此時查看兩個兩個 NameNode 的 web 頁面,都是 standby 狀態

http://192.168.3.198:50070http://192.168.3.199:50070

切換第一臺爲 Active 狀態

[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hdfs haadmin -transitionToActive nn1
Automatic failover is enabled for NameNode at im-test-hadoop02/192.168.3.199:9000
Refusing to manually manage HA state, since it may cause
a split-brain scenario or other incorrect state.
If you are very sure you know what you are doing, please 
specify the --forcemanual flag.
[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hdfs haadmin -transitionToActive -forcemanual nn1
You have specified the --forcemanual flag. This flag is dangerous, as it can induce a split-brain scenario that WILL CORRUPT your HDFS namespace, possibly irrecoverably.

It is recommended not to use this flag, but instead to shut down the cluster and disable automatic failover if you prefer to manually manage your HA state.

You may abort safely by answering 'n' or hitting ^C now.

Are you sure you want to continue? (Y or N) Y
18/11/01 14:56:49 WARN ha.HAAdmin: Proceeding with manual HA state management even though
automatic failover is enabled for NameNode at im-test-hadoop02/192.168.3.199:9000
18/11/01 14:56:49 WARN ha.HAAdmin: Proceeding with manual HA state management even though
automatic failover is enabled for NameNode at im-test-hadoop01/192.168.3.198:9000

再次查看 web 頁面,NameNode 已變爲 Active 狀態。

停止 hdfs 相關進程:

[hadoop@im-test-hadoop01 ~]$ jps
13936 Jps
21699 QuorumPeerMain
12870 NameNode
12559 JournalNode
[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh stop namenode
stopping namenode
[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh stop journalnode
stopping journalnode
[hadoop@im-test-hadoop01 ~]$ jps
14033 Jps
21699 QuorumPeerMain
[hadoop@im-test-hadoop01 ~]$
[hadoop@im-test-hadoop02 ~]$ jps
8202 NameNode
9098 Jps
16348 QuorumPeerMain
7805 JournalNode
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh stop namenode
stopping namenode
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh stop journalnode
stopping journalnode
[hadoop@im-test-hadoop02 ~]$ jps
9179 Jps
16348 QuorumPeerMain
[hadoop@im-test-hadoop02 ~]$
[hadoop@im-test-hadoop03 ~]$ jps
24584 QuorumPeerMain
16505 Jps
15899 JournalNode
[hadoop@im-test-hadoop03 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh stop journalnode
stopping journalnode
[hadoop@im-test-hadoop03 ~]$ jps
16567 Jps
24584 QuorumPeerMain
[hadoop@im-test-hadoop03 ~]$ 

通過在 Active NameNode 所在服務器上執行 sbin/start-dfs.sh 啓動集羣

[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/start-dfs.sh
Starting namenodes on [im-test-hadoop01 im-test-hadoop02]
im-test-hadoop01: starting namenode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-namenode-im-test-hadoop01.out
im-test-hadoop02: starting namenode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-namenode-im-test-hadoop02.out
im-test-hadoop01: starting datanode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-datanode-im-test-hadoop01.out
im-test-hadoop02: starting datanode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-datanode-im-test-hadoop02.out
im-test-hadoop03: starting datanode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-datanode-im-test-hadoop03.out
Starting journal nodes [im-test-hadoop01 im-test-hadoop02 im-test-hadoop03]
im-test-hadoop01: starting journalnode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-journalnode-im-test-hadoop01.out
im-test-hadoop02: starting journalnode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-journalnode-im-test-hadoop02.out
im-test-hadoop03: starting journalnode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-journalnode-im-test-hadoop03.out
Starting ZK Failover Controllers on NN hosts [im-test-hadoop01 im-test-hadoop02]
im-test-hadoop01: starting zkfc, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-zkfc-im-test-hadoop01.out
im-test-hadoop02: starting zkfc, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-zkfc-im-test-hadoop02.out
[hadoop@im-test-hadoop01 ~]$ jps
17904 JournalNode
21699 QuorumPeerMain
17496 NameNode
18330 Jps
18236 DFSZKFailoverController
17631 DataNode
[hadoop@im-test-hadoop01 ~]$

查看其它兩臺服務器服務進程

[hadoop@im-test-hadoop02 ~]$ jps
11841 DataNode
12277 Jps
11739 NameNode
16348 QuorumPeerMain
12206 DFSZKFailoverController
11967 JournalNode
[hadoop@im-test-hadoop02 ~]$
[hadoop@im-test-hadoop03 ~]$ jps
18099 JournalNode
18195 Jps
17972 DataNode
24584 QuorumPeerMain
[hadoop@im-test-hadoop03 ~]$

start-dfs.sh 腳本啓動的服務包括 NameNode、DataNode、JournalNode 和 ZKFC。

因爲已經開啓自動故障轉移,start-dfs.sh 腳本會在所有運行 NameNode 的服務器上自動啓動一個 ZKFC 進程。當所有的 ZKFC 啓動時,它們會選擇其中一個 NameNode 作爲 Active。

如果使用手動方式管理集羣服務,需要在每個運行 NameNode 的服務器上手動啓動 ZKFC 進程:
$HADOOP_PREFIX/sbin/hadoop-daemon.sh --script $HADOOP_PREFIX/bin/hdfs start zkfc

測試 HDFS HA
查看 NameNode 狀態,NameNode nn1 是 active狀態,nn2 是 standby 狀態:

[hadoop@im-test-hadoop03 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hdfs haadmin -getServiceState nn1
active
[hadoop@im-test-hadoop03 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hdfs haadmin -getServiceState nn2
standby
[hadoop@im-test-hadoop03 ~]$

向 hadoop 上傳一個文件

[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hadoop fs -put /mnt/softwares/hadoop-2.8.5/NOTICE.txt /
[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hadoop fs -ls /
Found 1 items
-rw-r--r--   3 hadoop supergroup      15915 2018-11-01 17:25 /NOTICE.txt
[hadoop@im-test-hadoop01 ~]$

現將 nn1 進程殺掉,查看 nn2 狀態和上面上傳文件信息

[hadoop@im-test-hadoop01 ~]$ jps
17904 JournalNode
21699 QuorumPeerMain
18933 Jps
17496 NameNode
18236 DFSZKFailoverController
17631 DataNode
[hadoop@im-test-hadoop01 ~]$ kill -9 17496
[hadoop@im-test-hadoop01 ~]$ jps
17904 JournalNode
21699 QuorumPeerMain
18983 Jps
18236 DFSZKFailoverController
17631 DataNode
[hadoop@im-test-hadoop01 ~]$
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hadoop fs -ls /
18/11/01 17:37:23 WARN ipc.Client: Failed to connect to server: im-test-hadoop01/192.168.3.198:9000: try once and fail.
java.net.ConnectException: 拒絕連接
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788)
	at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550)
	at org.apache.hadoop.ipc.Client.call(Client.java:1381)
	at org.apache.hadoop.ipc.Client.call(Client.java:1345)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
	at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:796)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
	at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1649)
	at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1440)
	at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1437)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1437)
	at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:64)
	at org.apache.hadoop.fs.Globber.doGlob(Globber.java:282)
	at org.apache.hadoop.fs.Globber.glob(Globber.java:148)
	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1686)
	at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)
	at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:245)
	at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:228)
	at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:103)
	at org.apache.hadoop.fs.shell.Command.run(Command.java:175)
	at org.apache.hadoop.fs.FsShell.run(FsShell.java:317)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
	at org.apache.hadoop.fs.FsShell.main(FsShell.java:380)
Found 1 items
-rw-r--r--   3 hadoop supergroup      15915 2018-11-01 17:25 /NOTICE.txt

以上已驗證實現了 NameNode 間的數據同步和故障自動轉移。

然後啓動 nn1,其爲 standby 狀態,然後停止 nn2,則 nn1 變爲 active 狀態,即可以進行 NameNode 狀態的來回切換。

第四步,啓動 yarn
在 im-test-hadoop01 服務器上執行:

[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-resourcemanager-im-test-hadoop01.out
im-test-hadoop01: starting nodemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-nodemanager-im-test-hadoop01.out
im-test-hadoop03: starting nodemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-nodemanager-im-test-hadoop03.out
im-test-hadoop02: starting nodemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-nodemanager-im-test-hadoop02.out
[hadoop@im-test-hadoop01 ~]$ jps
17904 JournalNode
21699 QuorumPeerMain
21971 NodeManager
19428 NameNode
22133 Jps
18236 DFSZKFailoverController
17631 DataNode

在 im-test-hadoop02、im-test-hadoop03 上啓動 ResourceManager:

[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-resourcemanager-im-test-hadoop02.out
[hadoop@im-test-hadoop02 ~]$ jps
11841 DataNode
17125 NodeManager
13990 NameNode
17623 Jps
16348 QuorumPeerMain
17357 ResourceManager
12206 DFSZKFailoverController
11967 JournalNode
[hadoop@im-test-hadoop02 ~]$
[hadoop@im-test-hadoop03 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-resourcemanager-im-test-hadoop03.out
[hadoop@im-test-hadoop03 ~]$ jps
18099 JournalNode
17972 DataNode
20966 NodeManager
24584 QuorumPeerMain
21308 Jps
21231 ResourceManager
[hadoop@im-test-hadoop03 ~]$

查看 ResourceManager 狀態

[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/yarn rmadmin -getServiceState rm1
active
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/yarn rmadmin -getServiceState rm2
standby
[hadoop@im-test-hadoop02 ~]$

訪問 yarn web 頁面 http://im-test-hadoop02:8088/cluster,正常顯示集羣信息
訪問 http://im-test-hadoop03:8088/cluster, 會自動跳轉到 http://im-test-hadoop02:8088/cluster ,因爲 rm2 是 standby 狀態。

Yarn HA 測試
現將 active ResourceManager rm1 殺掉,查看 rm2 狀態

[hadoop@im-test-hadoop02 ~]$ jps
11841 DataNode
18452 Jps
17125 NodeManager
13990 NameNode
16348 QuorumPeerMain
17357 ResourceManager
12206 DFSZKFailoverController
11967 JournalNode
[hadoop@im-test-hadoop02 ~]$ kill -9 17357
[hadoop@im-test-hadoop02 ~]$ jps
11841 DataNode
17125 NodeManager
18485 Jps
13990 NameNode
16348 QuorumPeerMain
12206 DFSZKFailoverController
11967 JournalNode
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/yarn rmadmin -getServiceState rm2
active
[hadoop@im-test-hadoop02 ~]$

rm2 狀態已變爲 active,yarn ha 配置有效。

再啓動 rm1

[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-resourcemanager-im-test-hadoop02.out
[hadoop@im-test-hadoop02 ~]$ jps
11841 DataNode
17125 NodeManager
13990 NameNode
18634 ResourceManager
16348 QuorumPeerMain
18701 Jps
12206 DFSZKFailoverController
11967 JournalNode
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/yarn rmadmin -getServiceState rm2
active
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/yarn rmadmin -getServiceState rm1
standby
[hadoop@im-test-hadoop02 ~]$

殺掉 rm2:

[hadoop@im-test-hadoop03 ~]$ jps
18099 JournalNode
17972 DataNode
21813 Jps
20966 NodeManager
24584 QuorumPeerMain
21231 ResourceManager
[hadoop@im-test-hadoop03 ~]$ kill -9 21231
[hadoop@im-test-hadoop03 ~]$ jps
21888 Jps
18099 JournalNode
17972 DataNode
20966 NodeManager
24584 QuorumPeerMain
[hadoop@im-test-hadoop03 ~]$

rm1 狀態變爲 active:

[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/yarn rmadmin -getServiceState rm1
active
[hadoop@im-test-hadoop02 ~]$
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章