五、測試
5.1 備節點失效
在node2上殺死postgres數據庫進程,模擬備節點上數據庫崩潰:
[root@node2 ~]# killall -9 postgres
查看此時集羣狀態:
[root@node1 ~]# crm_mon -Afr1 ============ Last updated: Mon Jan 27 08:36:49 2014 Stack: Heartbeat Current DC: node1 (30b7dc95-25c5-40d7-b1e4-7eaf2d5cdf07) - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, unknown expected votes 4 Resources configured. ============ Online: [ node1 node2 ] Full list of resources: vip-slave (ocf::heartbeat:IPaddr2): Started node1 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-rep (ocf::heartbeat:IPaddr2): Started node1 Master/Slave Set: msPostgresql Masters: [ node1 ] Stopped: [ pgsql:1 ] Clone Set: clnPingCheck Started: [ node1 node2 ] Node Attributes: * Node node1: + default_ping_set : 100 + master-pgsql:0 : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000010000000 + pgsql-status : PRI * Node node2: + default_ping_set : 100 + master-pgsql:1 : -INFINITY + pgsql-data-status : DISCONNECT + pgsql-status : STOP Migration summary: * Node node1: * Node node2: pgsql:1: migration-threshold=1 fail-count=1 Failed actions: pgsql:1_monitor_7000 (node=node2, call=11, rc=7, status=complete): not running
{vip-slave資源已成功切換到了node1上}
重啓node2上的heartbeat,數據庫將重新伴隨啓動:
[root@node2 ~]# service heartbeat restart
過段時間後查看狀態:
[root@node1 ~]# crm_mon -Afr1 ============ Last updated: Mon Jan 27 08:39:16 2014 Stack: Heartbeat Current DC: node1 (30b7dc95-25c5-40d7-b1e4-7eaf2d5cdf07) - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, unknown expected votes 4 Resources configured. ============ Online: [ node1 node2 ] Full list of resources: vip-slave (ocf::heartbeat:IPaddr2): Started node2 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-rep (ocf::heartbeat:IPaddr2): Started node1 Master/Slave Set: msPostgresql Masters: [ node1 ] Slaves: [ node2 ] Clone Set: clnPingCheck Started: [ node1 node2 ] Node Attributes: * Node node1: + default_ping_set : 100 + master-pgsql:0 : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000010000000 + pgsql-status : PRI * Node node2: + default_ping_set : 100 + master-pgsql:1 : 100 + pgsql-data-status : STREAMING|SYNC + pgsql-status : HS:sync Migration summary: * Node node1: * Node node2:
{vip-slave又重新回到了nod2上,且流複製重新建立}
5.2 主節點失效切換
在node1上殺死postgres數據庫進程,模擬備節點上數據庫崩潰:
[root@node1 ~]# killall -9 postgres
等會查看集羣狀態:
[root@node2 ~]# crm_mon -Afr -1 ============ Last updated: Mon Jan 27 08:43:03 2014 Stack: Heartbeat Current DC: node1 (30b7dc95-25c5-40d7-b1e4-7eaf2d5cdf07) - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, unknown expected votes 4 Resources configured. ============ Online: [ node1 node2 ] Full list of resources: vip-slave (ocf::heartbeat:IPaddr2): Started node2 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started node2 vip-rep (ocf::heartbeat:IPaddr2): Started node2 Master/Slave Set: msPostgresql Masters: [ node2 ] Stopped: [ pgsql:0 ] Clone Set: clnPingCheck Started: [ node1 node2 ] Node Attributes: * Node node1: + default_ping_set : 100 + master-pgsql:0 : -INFINITY + pgsql-data-status : DISCONNECT + pgsql-status : STOP * Node node2: + default_ping_set : 100 + master-pgsql:1 : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 00000000120000B0 + pgsql-status : PRI Migration summary: * Node node1: pgsql:0: migration-threshold=1 fail-count=1 * Node node2: Failed actions: pgsql:0_monitor_2000 (node=node1, call=25, rc=7, status=complete): not running
{vip-master/vip-rep都已成功切換到node2上,且node2已變爲master,node2上pg數據庫狀態已切換爲PRI}
5.3 主節點恢復
修復原主節點後將其恢復爲當前備節點
在node1上執行一次基礎同步:
[postgres@node1 data]$ pwd /opt/pgsql/data [postgres@node1 data]$ rm -rf * [postgres@node1 data]$ pg_basebackup -h 192.168.2.3 -U postgres -D /opt/pgsql/data/ -P 19172/19172 kB (100%), 1/1 tablespace NOTICE: pg_stop_backup complete, all required WAL segments have been archived [postgres@node1 data]$ ls backup_label base pg_clog pg_ident.conf pg_notify pg_stat_tmp pg_tblspc PG_VERSION postgresql.conf backup_label.old global pg_hba.conf pg_multixact pg_serial pg_subtrans pg_twophase pg_xlog recovery.done
啓動heartbeat之前必須刪除資鎖,不然資源將不會伴隨heartbeat啓動:
[root@node1 ~]# rm -rf /var/lib/pgsql/tmp/PGSQL.lock
{該鎖文件在當節點爲主節點時創建,但不會因爲heartbeat的異常停止或數據庫/系統的異常終止而自動刪除,所以在恢復一個節點的時候只要該節點充當過主節點就需要手動清理該鎖文件}
重啓node1上的heartbeat:
[root@node1 ~]# service heartbeat restart
過段時間後查看集羣狀態:
[root@node2 ~]# crm_mon -Afr1 ============ Last updated: Mon Jan 27 08:50:43 2014 Stack: Heartbeat Current DC: node2 (f2dcd1df-7429-42f5-82e9-b73921f97cab) - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, unknown expected votes 4 Resources configured. ============ Online: [ node1 node2 ] Full list of resources: vip-slave (ocf::heartbeat:IPaddr2): Started node1 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started node2 vip-rep (ocf::heartbeat:IPaddr2): Started node2 Master/Slave Set: msPostgresql Masters: [ node2 ] Slaves: [ node1 ] Clone Set: clnPingCheck Started: [ node1 node2 ] Node Attributes: * Node node1: + default_ping_set : 100 + master-pgsql:0 : 100 + pgsql-data-status : STREAMING|SYNC + pgsql-status : HS:sync * Node node2: + default_ping_set : 100 + master-pgsql:1 : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 00000000120000B0 + pgsql-status : PRI Migration summary: * Node node1: * Node node2:
{vip-slave已成功切到node1上,node1成功成爲流複製備節點}
六、管理
6.1 啓動關閉heartbeat
[root@node1 ~]# service heartbeat start [root@node1 ~]# service heartbeat stop
6.2 查看HA狀態
[root@node1 ~]# crm status
6.3 查看資源狀態及節點屬性
[root@node1 ~]# crm_mon -Afr -1
6.4 查看配置
[root@node1 ~]# crm configure show
6.5 實時監控HA
[root@node1 ~]# crm_mon -Afr
6.6 crm_resource命令
資源啓動/關閉:
[root@node1 ~]# crm_resource -r vip-master -v started [root@node1 ~]# crm_resource -r vip-master -v stoped
列舉資源:
[root@node1 ~]# crm_resource -L vip-slave (ocf::heartbeat:IPaddr2): Started Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started vip-rep (ocf::heartbeat:IPaddr2): Started Master/Slave Set: msPostgresql [pgsql] Masters: [ node1 ] Slaves: [ node2 ] Clone Set: clnPingCheck [pingCheck] Started: [ node1 node2 ]
查看資源位置:
[root@node1 ~]# crm_resource -W -r pgsql resource pgsql is running on: node2
遷移資源:
[root@node1 ~]# crm_resource -M -r vip-slave -N node2
刪除資源:
[root@node1 ~]# crm_resource -D -r vip-slave -t primitive
6.7 crm命令
列舉指定的RA:
[root@node1 ~]# crm ra list ocf pacemaker ClusterMon Dummy HealthCPU HealthSMART Stateful SysInfo SystemHealth controld ping pingd remote
刪除節點:
[root@node1 ~]# crm node delete node2
停用節點:
[root@node1 ~]# crm node standby node2
啓用節點:
[root@node1 ~]# crm node online node2
配置pacemaker:
[root@node1 ~]# crm configure crm(live)configure# …… …… crm(live)configure# commit crm(live)configure# quit
6.8 重置failcount
[root@node1 ~]# crm resource crm(live)resource# failcount pgsql set node1 0 crm(live)resource# failcount pgsql show node1 scope=status name=fail-count-pgsql value=0
[root@node1 ~]# crm resource cleanup pgsql Cleaning up pgsql:0 on node1 Waiting for 1 replies from the CRMd. OK
[root@node1 ~]# crm_failcount -G -U node1 -r pgsql scope=status name=fail-count-pgsql value=INFINITY [root@node1 ~]# crm_failcount -D -U node1 -r pgsql
七、問題記錄
7.1 Q1
問題現象:
heartbeat日誌中報如下錯誤:
Jan 24 07:47:36 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
Jan 24 07:47:38 node1 heartbeat: [2515]: WARN: nodename node2 uuid changed to node1
Jan 24 07:47:38 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
Jan 24 07:47:40 node1 heartbeat: [2515]: WARN: nodename node2 uuid changed to node1
Jan 24 07:47:40 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
Jan 24 07:47:42 node1 heartbeat: [2515]: WARN: nodename node2 uuid changed to node1
Jan 24 07:47:42 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
解決方式:
因爲是通過虛擬機克隆生成的node2,所以hb_uuid相同,需要刪除後重新生成,如下:
[root@node2 ~]# rm -rf /var/lib/heartbeat/hb_uuid [root@node2 ~]# service heartbeat restart
重啓之後將會生成新的hb_uuid
7.2 Q2
問題現象:
加載配置時報錯:
[root@node1 ~]# crm configure load update pgsql.crm ERROR: pgsql: parameter rep_mode does not exist ERROR: pgsql: parameter node_list does not exist ERROR: pgsql: parameter master_ip does not exist ERROR: pgsql: parameter restore_command does not exist ERROR: pgsql: parameter primary_conninfo_opt does not exist WARNING: pgsql: specified timeout 60s for stop is smaller than the advised 120 WARNING: pgsql: action monitor_Master not advertised in meta-data, it may not be supported by the RA WARNING: pgsql: specified timeout 60s for start is smaller than the advised 120 WARNING: pgsql: action notify not advertised in meta-data, it may not be supported by the RA WARNING: pgsql: action demote not advertised in meta-data, it may not be supported by the RA WARNING: pgsql: action promote not advertised in meta-data, it may not be supported by the RA WARNING: pingCheck: specified timeout 60s for start is smaller than the advised 90 WARNING: pingCheck: specified timeout 60s for stop is smaller than the advised 100 Do you still want to commit?
解決方式:
原因是pgsql腳本過舊,不支持配置pgsql.crm中設置的一些參數,需要從網上下載並替換pgsql
https://raw.github.com/ClusterLabs/resource-agents
7.3 Q3
問題現象:
加載配置時報錯:
[root@node1 ~]# crm configure load update pgsql.crm lrmadmin[15368]: 2014/01/24_09:18:44 ERROR: lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply message of rmetadata with function get_ret_from_msg. ERROR: ocf:heartbeat:pgsql: could not parse meta-data: ERROR: ocf:heartbeat:pgsql: could not parse meta-data: ERROR: ocf:heartbeat:pgsql: no such resource agent WARNING: pingCheck: specified timeout 60s for start is smaller than the advised 90 WARNING: pingCheck: specified timeout 60s for stop is smaller than the advised 100 Do you still want to commit?
解決方式:
原因是pgsql腳本權限不正確,使用下面命令修改即可:
# chmod 755 /usr/lib/ocf/resource.d/heartbeat/pgsql
7.4 Q4
問題現象:
啓動heartbeat時報錯:
[root@node1 ~]# service heartbeat start /usr/lib/ocf/lib//heartbeat/ocf-shellfuncs: line 56: @OCF_ROOT_DIR@/lib/heartbeat/ocf-binaries: No such file or directory
解決方式:
因爲在CentOS5.5中@OCF_ROOT_DIR@變量無法替換爲正確的路徑導致,可通過修改腳本實現,如下:
編輯ocf-shellfuncs修改如下內容:
if [ -z "$OCF_ROOT" ]; then
# : ${OCF_ROOT=@OCF_ROOT_DIR@}
: ${OCF_ROOT=/usr/lib/ocf}
fi
7.5 Q5
問題現象:
啓動heartbeat時報錯:
# service heartbeat start /usr/lib/ocf/lib//heartbeat/ocf-shellfuncs: line 60: /usr/lib/ocf/lib/heartbeat/ocf-rarun: No such file or directory
解決方式:
因爲缺少ocf-rarun腳本導致
下載放入相應路徑即可:
下載地址https://raw.github.com/ClusterLabs/resource-agents
7.6 Q6
問題現象:
啓動heartbeat時因找不到啓動腳本而報錯:
[root@db1 ~]# service heartbeat start Starting High-Availability services: Heartbeat failure [rc=6]. Failed. heartbeat[2074]: 2014/01/23_09:06:59 info: Pacemaker support: yes heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/cib] is not executable heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive failfast hacluster /usr/lib64/heartbeat/cib failed heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/stonithd] is not executable heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive respawn root /usr/lib64/heartbeat/stonithd failed heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/attrd] is not executable heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive respawn hacluster /usr/lib64/heartbeat/attrd failed heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/crmd] is not executable heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive failfast hacluster /usr/lib64/heartbeat/crmd failed heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Heartbeat not started: configuration error. heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Configuration error, heartbeat not started.
解決方式:
ln -s /usr/libexec/pacemaker/cib /usr/lib64/heartbeat/cib ln -s /usr/libexec/pacemaker/stonithd /usr/lib64/heartbeat/stonithd ln -s /usr/libexec/pacemaker/attrd /usr/lib64/heartbeat/attrd ln -s /usr/libexec/pacemaker/crmd /usr/lib64/heartbeat/crmd
7.7 Q7
問題現象:
啓動heartbeat時報錯:
Jan 23 09:10:15 db1 heartbeat: [2129]: info: Heartbeat generation: 1390439416
Jan 23 09:10:15 db1 heartbeat: [2129]: info: No uuid found for current node - generating a new uuid.
Jan 23 09:10:15 db1 heartbeat: [2129]: info: Creating FIFO /var/lib/heartbeat/fifo.
Jan 23 09:10:15 db1 heartbeat: [2129]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1
Jan 23 09:10:15 db1 heartbeat: [2129]: info: glib: ucast: bound send socket to device: eth1
Jan 23 09:10:15 db1 heartbeat: [2129]: ERROR: glib: ucast: error setting option SO_REUSEPORT(w): Protocol not available
Jan 23 09:10:15 db1 heartbeat: [2129]: ERROR: make_io_childpair: cannot open ucast eth1
Jan 23 09:10:16 db1 heartbeat: [2132]: CRIT: Emergency Shutdown: Master Control process died.
Jan 23 09:10:16 db1 heartbeat: [2132]: CRIT: Killing pid 2129 with SIGTERM
Jan 23 09:10:16 db1 heartbeat: [2132]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves.
解決方式:
1.升級內核版本,當前內核版本不支持ucast;
2.換用其它的檢測方式,如mcast/bcast。
7.8 Q8
問題現象:
使用bcast心跳檢測方式時報錯:
Jan 24 01:30:20 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
Jan 24 01:30:21 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
Jan 24 01:30:22 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
Jan 24 01:30:23 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
Jan 24 01:30:24 db2 heartbeat: [29856]: ERROR: glib: Unable to bind socket (Address already in use). Giving up.
Jan 24 01:30:24 db2 heartbeat: [29856]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth1 - Status: 1
Jan 24 01:30:24 db2 heartbeat: [29856]: ERROR: make_io_childpair: cannot open bcast eth1
Jan 24 01:30:25 db2 heartbeat: [29859]: CRIT: Emergency Shutdown: Master Control process died.
Jan 24 01:30:25 db2 heartbeat: [29859]: CRIT: Killing pid 29856 with SIGTERM
Jan 24 01:30:25 db2 heartbeat: [29859]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves.
解決方式:
說明694端口已經被佔用,查看
[root@db1 ~]# netstat -nlp | grep 694 udp 0 0 0.0.0.0:694 0.0.0.0:* 1367/rpcbind udp 0 0 :::694 :::* 1367/rpcbind
換個UDP端口,如在ha.cf中指定udpport 692
八、參考資源
腳本:
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
腳本使用說明:
crm_resouce命令:
http://www.novell.com/zh-cn/documentation/sle_ha/book_sleha/data/man_crmresource.html
crm_failcount命令:
http://www.novell.com/zh-cn/documentation/sle_ha/book_sleha/data/man_crmfailcount.html