五、測試
5.1 備節點失效
在node2上殺死postgres數據庫進程,模擬備節點上數據庫崩潰:
[root@node2 ~]# killall -9 postgres
查看此時集羣狀態:
[root@node1 ~]# crm_mon -Afr -1 Last updated: Wed Jan 22 02:15:06 2014 Last change: Wed Jan 22 02:15:33 2014 via crm_attribute on node1 Stack: classic openais (with plugin) Current DC: node1 - partition with quorum Version: 1.1.10-14.el6_5.2-368c726 2 Nodes configured, 2 expected votes 7 Resources configured Online: [ node1 node2 ] Full list of resources: vip-slave (ocf::heartbeat:IPaddr2): Started node1 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-rep (ocf::heartbeat:IPaddr2): Started node1 Master/Slave Set: msPostgresql [pgsql] Masters: [ node1 ] Stopped: [ node2 ] Clone Set: clnPingCheck [pingCheck] Started: [ node1 node2 ] Node Attributes: * Node node1: + default_ping_set : 100 + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000006000078 + pgsql-status : PRI * Node node2: + default_ping_set : 100 + master-pgsql : -INFINITY + pgsql-data-status : DISCONNECT + pgsql-status : STOP Migration summary: * Node node2: pgsql: migration-threshold=1 fail-count=1 last-failure='Wed Jan 22 02:15:35 2014' * Node node1: Failed actions: pgsql_monitor_7000 on node2 'not running' (7): call=42, status=complete, last-rc-change='Wed Jan 22 02:14:58 2014', queued=0ms, exec=0ms
{vip-slave資源已成功切換到了node1上}
重啓node2上的corosync,數據庫將重新伴隨啓動:
[root@node2 ~]# service corosync restart [root@node1 ~]# crm_mon -Afr -1 Last updated: Wed Jan 22 02:16:24 2014 Last change: Wed Jan 22 02:16:55 2014 via crm_attribute on node1 Stack: classic openais (with plugin) Current DC: node1 - partition with quorum Version: 1.1.10-14.el6_5.2-368c726 2 Nodes configured, 2 expected votes 7 Resources configured Online: [ node1 node2 ] Full list of resources: vip-slave (ocf::heartbeat:IPaddr2): Started node2 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-rep (ocf::heartbeat:IPaddr2): Started node1 Master/Slave Set: msPostgresql [pgsql] Masters: [ node1 ] Slaves: [ node2 ] Clone Set: clnPingCheck [pingCheck] Started: [ node1 node2 ] Node Attributes: * Node node1: + default_ping_set : 100 + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000006000078 + pgsql-status : PRI * Node node2: + default_ping_set : 100 + master-pgsql : 100 + pgsql-data-status : STREAMING|SYNC + pgsql-status : HS:sync Migration summary: * Node node2: * Node node1:
{vip-slave又重新回到了nod2上}
5.2 主節點失效切換
在node1上殺死postgres數據庫進程,模擬主節點上數據庫崩潰:
[root@node1 ~]# killall -9 postgres
等會查看集羣狀態:
[root@node2 ~]# crm_mon -Afr -1 Last updated: Wed Jan 22 02:17:50 2014 Last change: Wed Jan 22 02:18:16 2014 via crm_attribute on node2 Stack: classic openais (with plugin) Current DC: node1 - partition with quorum Version: 1.1.10-14.el6_5.2-368c726 2 Nodes configured, 2 expected votes 7 Resources configured Online: [ node1 node2 ] Full list of resources: vip-slave (ocf::heartbeat:IPaddr2): Started node2 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started node2 vip-rep (ocf::heartbeat:IPaddr2): Started node2 Master/Slave Set: msPostgresql [pgsql] Masters: [ node2 ] Stopped: [ node1 ] Clone Set: clnPingCheck [pingCheck] Started: [ node1 node2 ] Node Attributes: * Node node1: + default_ping_set : 100 + master-pgsql : -INFINITY + pgsql-data-status : DISCONNECT + pgsql-status : STOP * Node node2: + default_ping_set : 100 + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000008014A70 + pgsql-status : PRI Migration summary: * Node node2: * Node node1: pgsql: migration-threshold=1 fail-count=1 last-failure='Wed Jan 22 02:18:11 2014' Failed actions: pgsql_monitor_2000 on node1 'not running' (7): call=2435, status=complete, last-rc-change='Wed Jan 22 02:18:11 2014', queued=0ms, exec=0ms
{vip-master/vip-rep都已成功切換到node2上,且node2已變爲master,node2上pg數據庫狀態已切換爲PRI}
停止node1上的corosync:
[root@node1 ~]# service corosync stop
執行一次基礎同步:
[postgres@node1 data]$ pwd /opt/pgsql/data [postgres@node1 data]$ rm -rf * [postgres@node1 data]$ pg_basebackup -h 192.168.1.3 -U postgres -D /opt/pgsql/data/ -P 19172/19172 kB (100%), 1/1 tablespace NOTICE: pg_stop_backup complete, all required WAL segments have been archived [postgres@node1 data]$ ls backup_label base pg_clog pg_ident.conf pg_notify pg_stat_tmp pg_tblspc PG_VERSION postgresql.conf backup_label.old global pg_hba.conf pg_multixact pg_serial pg_subtrans pg_twophase pg_xlog recovery.done
啓動node1上的corosync:
[root@node1 ~]# service corosync start
5.3 主節點恢復
修復原主節點後將其恢復爲當前備節點
在node1上執行一次基礎同步:
[postgres@node1 data]$ pwd /opt/pgsql/data [postgres@node1 data]$ rm -rf * [postgres@node1 data]$ pg_basebackup -h 192.168.2.3 -U postgres -D /opt/pgsql/data/ -P 19172/19172 kB (100%), 1/1 tablespace NOTICE: pg_stop_backup complete, all required WAL segments have been archived [postgres@node1 data]$ ls backup_label base pg_clog pg_ident.conf pg_notify pg_stat_tmp pg_tblspc PG_VERSION postgresql.conf backup_label.old global pg_hba.conf pg_multixact pg_serial pg_subtrans pg_twophase pg_xlog recovery.done
啓動heartbeat之前必須刪除資鎖,不然資源將不會伴隨heartbeat啓動:
[root@node1 ~]# rm -rf /var/lib/pgsql/tmp/PGSQL.lock
{該鎖文件在當節點爲主節點時創建,但不會因爲heartbeat的異常停止或數據庫/系統的異常終止而自動刪除,所以在恢復一個節點的時候只要該節點充當過主節點就需要手動清理該鎖文件}
重啓node1上的heartbeat:
[root@node1 ~]# service heartbeat restart
過段時間後查看集羣狀態:
[root@node2 ~]# crm_mon -Afr1 ============ Last updated: Mon Jan 27 08:50:43 2014 Stack: Heartbeat Current DC: node2 (f2dcd1df-7429-42f5-82e9-b73921f97cab) - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, unknown expected votes 4 Resources configured. ============ Online: [ node1 node2 ] Full list of resources: vip-slave (ocf::heartbeat:IPaddr2): Started node1 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started node2 vip-rep (ocf::heartbeat:IPaddr2): Started node2 Master/Slave Set: msPostgresql Masters: [ node2 ] Slaves: [ node1 ] Clone Set: clnPingCheck Started: [ node1 node2 ] Node Attributes: * Node node1: + default_ping_set : 100 + master-pgsql:0 : 100 + pgsql-data-status : STREAMING|SYNC + pgsql-status : HS:sync * Node node2: + default_ping_set : 100 + master-pgsql:1 : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 00000000120000B0 + pgsql-status : PRI Migration summary: * Node node1: * Node node2:
{vip-slave已成功切到node1上,node1成功成爲流複製備節點}
六、管理
6.1 啓動關閉corosync
[root@node1 ~]# service corosync start [root@node1 ~]# service corosync stop
6.2 查看HA狀態
[root@node1 ~]# crm status Last updated: Tue Jan 21 23:55:13 2014 Last change: Tue Jan 21 23:37:36 2014 via crm_attribute on node1 Stack: classic openais (with plugin) Current DC: node1 - partition with quorum Version: 1.1.10-14.el6_5.2-368c726 2 Nodes configured, 2 expected votes 7 Resources configured Online: [ node1 node2 ] vip-slave (ocf::heartbeat:IPaddr2): Started node2 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-rep (ocf::heartbeat:IPaddr2): Started node1 Master/Slave Set: msPostgresql [pgsql] Masters: [ node1 ] Slaves: [ node2 ] Clone Set: clnPingCheck [pingCheck] Started: [ node1 node2 ]
6.3 查看資源狀態及節點屬性
[root@node1 ~]# crm_mon -Afr -1 Last updated: Tue Jan 21 23:37:20 2014 Last change: Tue Jan 21 23:37:36 2014 via crm_attribute on node1 Stack: classic openais (with plugin) Current DC: node1 - partition with quorum Version: 1.1.10-14.el6_5.2-368c726 2 Nodes configured, 2 expected votes 7 Resources configured Online: [ node1 node2 ] Full list of resources: vip-slave (ocf::heartbeat:IPaddr2): Started node2 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-rep (ocf::heartbeat:IPaddr2): Started node1 Master/Slave Set: msPostgresql [pgsql] Masters: [ node1 ] Slaves: [ node2 ] Clone Set: clnPingCheck [pingCheck] Started: [ node1 node2 ] Node Attributes: * Node node1: + default_ping_set : 100 + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000006000078 + pgsql-status : PRI * Node node2: + default_ping_set : 100 + master-pgsql : 100 + pgsql-data-status : STREAMING|SYNC + pgsql-status : HS:sync Migration summary: * Node node2: * Node node1:
6.4 查看配置
[root@node1 ~]# crm configure show node node1 \ attributes pgsql-data-status="LATEST" node node2 \ attributes pgsql-data-status="STREAMING|SYNC" primitive pgsql ocf:heartbeat:pgsql \ params pgctl="/opt/pgsql/bin/pg_ctl" psql="/opt/pgsql/bin/psql" pgdata="/opt/pgsql/data/" start_opt="-p 5432" rep_mode="sync" node_list="node1 node2" restore_command="cp /opt/archivelog/%f %p" primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip="192.168.1.3" stop_escalate="0" \ op start timeout="60s" interval="0s" on-fail="restart" \ op monitor timeout="60s" interval="7s" on-fail="restart" \ op monitor timeout="60s" interval="2s" on-fail="restart" role="Master" \ op promote timeout="60s" interval="0s" on-fail="restart" \ op demote timeout="60s" interval="0s" on-fail="stop" \ …… ……
6.5 實時監控HA
[root@node1 ~]# crm_mon -Afr Last updated: Wed Jan 22 00:40:12 2014 Last change: Tue Jan 21 23:37:36 2014 via crm_attribute on node1 Stack: classic openais (with plugin) Current DC: node1 - partition with quorum Version: 1.1.10-14.el6_5.2-368c726 2 Nodes configured, 2 expected votes 7 Resources configured Online: [ node1 node2 ] Full list of resources: vip-slave (ocf::heartbeat:IPaddr2): Started node2 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-rep (ocf::heartbeat:IPaddr2): Started node1 Master/Slave Set: msPostgresql [pgsql] Masters: [ node1 ] Slaves: [ node2 ] Clone Set: clnPingCheck [pingCheck] Started: [ node1 node2 ] Node Attributes: * Node node1: + default_ping_set : 100 + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000006000078 + pgsql-status : PRI * Node node2: + default_ping_set : 100 + master-pgsql : 100 + pgsql-data-status : STREAMING|SYNC + pgsql-status : HS:sync Migration summary:* Node node2: * Node node1:
6.6 crm_resource命令
資源啓動/關閉:
[root@node1 ~]# crm_resource -r vip-master -v started [root@node1 ~]# crm_resource -r vip-master -v stoped
列舉資源:
[root@node1 ~]# crm_resource -L vip-slave (ocf::heartbeat:IPaddr2): Started Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started vip-rep (ocf::heartbeat:IPaddr2): Started Master/Slave Set: msPostgresql [pgsql] Masters: [ node1 ] Slaves: [ node2 ] Clone Set: clnPingCheck [pingCheck] Started: [ node1 node2 ]
查看資源位置:
[root@node1 ~]# crm_resource -W -r pgsql resource pgsql is running on: node2
遷移資源:
[root@node1 ~]# crm_resource -M -r vip-slave -N node2
刪除資源:
[root@node1 ~]# crm_resource -D -r vip-slave -t primitive
6.7 crm命令
列舉指定的RA:
[root@node1 ~]# crm ra list ocf pacemaker ClusterMon Dummy HealthCPU HealthSMART Stateful SysInfo SystemHealth controld ping pingd remote
刪除節點:
[root@node1 ~]# crm node delete node2
停用節點:
[root@node1 ~]# crm node standby node2
啓用節點:
[root@node1 ~]# crm node online node2
配置pacemaker:
[root@node1 ~]# crm configure crm(live)configure# …… …… crm(live)configure# commit crm(live)configure# quit
6.8 重置failcount
[root@node1 ~]# crm resource crm(live)resource# failcount pgsql set node1 0 crm(live)resource# failcount pgsql show node1 scope=status name=fail-count-pgsql value=0 [root@node1 ~]# crm resource cleanup pgsql Cleaning up pgsql:0 on node1 Waiting for 1 replies from the CRMd. OK [root@node1 ~]# crm_failcount -G -U node1 -r pgsql scope=status name=fail-count-pgsql value=INFINITY [root@node1 ~]# crm_failcount -D -U node1 -r pgsql
七、問題記錄
7.1 Q1
問題現象:
corosync.log日誌中報錯:
Jan 15 10:23:57 node1 lrmd: [6327]: info: RA output: (pgsql:0:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/pgsql: line 1749: ocf_local_nodename: command not found
Jan 15 10:23:57 node1 crm_attribute: [11094]: info: Invoked: /usr/sbin/crm_attribute -l reboot -N -n -v 0000000006000090 pgsql-xlog-loc lrm_get_rsc_type_metadata(578)
Jan 15 10:23:57 node1 lrmd: [6327]: info: RA output: (pgsql:0:monitor:stderr) Could not map uname=-n to a UUID: The object/attribute does not exist
解決方式:
查看pgsql腳本,發現其中使用了ocf_local_nodename,該函數本該在ocf-shellfuncs.in中有定義,但卻沒有這個函數,上網查看相關論壇
http://www.gossamer-threads.com/lists/linuxha/users/89379?do=post_view_threaded
指出此時需要相關補丁,解決ocf_local_nodename函數的補丁:
https://github.com/ClusterLabs/resource-agents/commit/abc1c3f6464f6e5e7a1e41cd7c9b8179896c1903
最新的版本沒有ocf_local_nodename函數,所以使用以下版本:
{注:確保pacemaker版本>1.1.8,不然crm_node -n命令無法使用}
不含有ocf_local_nodename函數的pgsql腳本:
7.2 Q2
問題現象:
[root@node1 ~]# crm configure load update pgsql.crm WARNING: pingCheck: specified timeout 60s for start is smaller than the advised 90 WARNING: pingCheck: specified timeout 60s for stop is smaller than the advised 100 WARNING: pgsql: specified timeout 60s for stop is smaller than the advised 120 WARNING: pgsql: specified timeout 60s for start is smaller than the advised 120 WARNING: pgsql: specified timeout 60s for notify is smaller than the advised 90 WARNING: pgsql: specified timeout 60s for demote is smaller than the advised 120 WARNING: pgsql: specified timeout 60s for promote is smaller than the advised 120 ERROR: master-group: attribute ordered does not exist Do you still want to commit? no
解決方式:
錯誤提示:在定義的master-group中ordered屬性不存在
(1)該問題是pacemaker版本所致,在pacemaker-1.1版本中不支持ordered,colocated屬性,通過以下方法以1.0版本的cibconfig.py替換當前新版本試圖解決此問題,結果失敗:
[root@node1 ~]# vim /usr/lib64/python2.6/site-packages/crmsh/cibconfig.py [root@node1 ~]# cd /usr/lib64/python2.6/site-packages/crmsh/ [root@node1 crmsh]# mv cibconfig.py cibconfig.py.bak [root@node1 crmsh]# wget https://github.com/ClusterLabs/pacemaker-1.0/blob/fa1a99ab36e0ed015f1bcbbb28f7db962a9d1abc/shell/modules/cibconfig.py
(2)從配置腳本中去除關於ordered的定義(成功):
group master-group \
vip-master \
vip-rep \
meta \
ordered="false"
改爲:
group master-group \
vip-master \
vip-rep
7.3 Q3
問題現象:
安裝pacemaker時報錯:
# yum install pacemaker* …… --> Processing Dependency: libesmtp.so.5()(64bit) for package: pacemaker --> Finished Dependency Resolution pacemaker-1.0.12-1.el5.centos.i386 from clusterlabs has depsolving problems --> Missing Dependency: libesmtp.so.5 is needed by package pacemaker-1.0.12-1.el5.centos.i386 (clusterlabs) pacemaker-1.0.12-1.el5.centos.x86_64 from clusterlabs has depsolving problems --> Missing Dependency: libesmtp.so.5()(64bit) is needed by package pacemaker-1.0.12-1.el5.centos.x86_64 (clusterlabs) Error: Missing Dependency: libesmtp.so.5 is needed by package pacemaker-1.0.12-1.el5.centos.i386 (clusterlabs) Error: Missing Dependency: libesmtp.so.5()(64bit) is needed by package pacemaker-1.0.12-1.el5.centos.x86_64 (clusterlabs) You could try using --skip-broken to work around the problem You could try running: package-cleanup --problems package-cleanup --dupes rpm -Va --nofiles --nodigest The program package-cleanup is found in the yum-utils package.
解決方式:
提示缺少libesmtp,安裝即可
# wget ftp://ftp.univie.ac.at/systems/linux/fedora/epel/5/x86_64/libesmtp-1.0.4-5.el5.x86_64.rpm # wget ftp://ftp.univie.ac.at/systems/linux/fedora/epel/5/i386/libesmtp-1.0.4-5.el5.i386.rpm # rpm -ivh libesmtp-1.0.4-5.el5.x86_64.rpm # rpm -ivh libesmtp-1.0.4-5.el5.i386.rpm
7.4 Q4
問題現象:
加載crm配置時報錯:
[root@node1 ~]# crm configure load update pgsql.crm ERROR: pgsql: parameter rep_mode does not exist ERROR: pgsql: parameter node_list does not exist ERROR: pgsql: parameter master_ip does not exist ERROR: pgsql: parameter restore_command does not exist ERROR: pgsql: parameter primary_conninfo_opt does not exist WARNING: pgsql: specified timeout 60s for stop is smaller than the advised 120 WARNING: pgsql: action monitor_Master not advertised in meta-data, it may not be supported by the RA WARNING: pgsql: specified timeout 60s for start is smaller than the advised 120 WARNING: pgsql: action notify not advertised in meta-data, it may not be supported by the RA WARNING: pgsql: action demote not advertised in meta-data, it may not be supported by the RA WARNING: pgsql: action promote not advertised in meta-data, it may not be supported by the RA WARNING: pingCheck: specified timeout 60s for start is smaller than the advised 90 WARNING: pingCheck: specified timeout 60s for stop is smaller than the advised 100 Do you still want to commit? no
解決方式:
參數不存在是因爲pgsql腳本太舊,需要替換
scp pgsql [email protected]:/usr/lib/ocf/resource.d/heartbeat/ scp ocf-shellfuncs.in [email protected]:/usr/lib/ocf/lib/heartbeat/ scp pgsql [email protected]:/usr/lib/ocf/resource.d/heartbeat/ scp ocf-shellfuncs.in [email protected]:/usr/lib/ocf/lib/heartbeat/
7.5 Q5
問題現象:
[root@node1 ~]# crm_mon -Afr -1 Last updated: Tue Jan 21 05:10:56 2014 Last change: Tue Jan 21 05:10:08 2014 via cibadmin on node1 Stack: classic openais (with plugin) Current DC: node1 - partition with quorum Version: 1.1.10-14.el6_5.2-368c726 2 Nodes configured, 2 expected votes 7 Resources configured Online: [ node1 node2 ] Full list of resources: vip-slave (ocf::heartbeat:IPaddr2): Stopped Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Stopped vip-rep (ocf::heartbeat:IPaddr2): Stopped Master/Slave Set: msPostgresql [pgsql] Stopped: [ node1 node2 ] Clone Set: clnPingCheck [pingCheck] Stopped: [ node1 node2 ] Node Attributes: * Node node1: * Node node2: Migration summary: * Node node1: * Node node2: Failed actions: pingCheck_monitor_0 on node1 'invalid parameter' (2): call=23, status=complete, last-rc-change='Tue Jan 21 05:10:10 2014', queued=200ms, exec=0ms pingCheck_monitor_0 on node2 'invalid parameter' (2): call=23, status=complete, last-rc-change='Tue Jan 21 05:09:36 2014', queued=281ms, exec=0ms
解決方式:
該錯誤是因爲腳本定義中的pingCheck調用的pingd腳本中存在未知參數,經查ocf/pacemaker/pingd中不存在multiplier參數:
primitive pingCheck ocf:pacemaker:pingd \
params \
name="default_ping_set" \
host_list="192.168.100.1" \
multiplier="100" \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="ignore"
因此將調用改爲ocf:heartbeat:pingd
7.6 Q6
問題現象:
corosync日誌中報錯:
Jan 21 04:36:02 corosync [TOTEM ] Received message has invalid digest... ignoring.
Jan 21 04:36:02 corosync [TOTEM ] Invalid packet data
解決方式:
說明網絡中存在相同的多播,更改多播地址即可。
八、參考資源
腳本:
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
腳本使用說明:
crm_resouce命令:
http://www.novell.com/zh-cn/documentation/sle_ha/book_sleha/data/man_crmresource.html
crm_failcount命令:
http://www.novell.com/zh-cn/documentation/sle_ha/book_sleha/data/man_crmfailcount.html