MHA實施參考

MHA(Master High Availability)目前在MySQL高可用方面是一個相對成熟的解決方案,它由日本DeNA公司youshimaton(現就職於Facebook公司)開發,是一套優秀的作爲MySQL高可用性環境下故障切換和主從提升的高可用軟件。在MySQL故障切換過程中,MHA能做到在0~30秒之內自動完成數據庫的故障切換操作,並且在進行故障切換的過程中,MHA能在最大程度上保證數據的一致性,以達到真正意義上的高可用。

該軟件由兩部分組成:MHA Manager(管理節點)和MHA Node(數據節點)。MHA Manager可以單獨部署在一臺獨立的機器上管理多個master-slave集羣,也可以部署在一臺slave節點上。MHA Node運行在每臺MySQL服務器上,MHA Manager會定時探測集羣中的master節點,當master出現故障時,它可以自動將最新數據的slave提升爲新的master,然後將所有其他的slave重新指向新的master。整個故障轉移過程對應用程序完全透明。

在MHA自動故障切換過程中,MHA試圖從宕機的主服務器上保存二進制日誌,最大程度的保證數據的不丟失,但這並不總是可行的。例如,如果主服務器硬件故障或無法通過ssh訪問,MHA沒法保存二進制日誌,只進行故障轉移而丟失了最新的數據。使用MySQL 5.5的半同步複製,可以大大降低數據丟失的風險。MHA可以與半同步複製結合起來。如果只有一個slave已經收到了最新的二進制日誌,MHA可以將最新的二進制日誌應用於其他所有的slave服務器上,因此可以保證所有節點的數據一致性。

目前MHA主要支持一主多從的架構,要搭建MHA,要求一個複製集羣中必須最少有三臺數據庫服務器,一主二從,即一臺充當master,一臺充當備用master,另外一臺充當從庫,因爲至少需要三臺服務器,出於機器成本的考慮,淘寶也在該基礎上進行了改造,目前淘寶TMHA已經支持一主一從。

我們自己使用其實也可以使用1主1從,但是master主機宕機後無法切換,以及無法補全binlog。master的mysqld進程crash後,還是可以切換成功,以及補全binlog的。


圖01展示瞭如何通過MHA Manager管理多組主從複製。可以將MHA工作原理總結爲如下:

 

                                 ( 圖01 )

(1)從宕機崩潰的master保存二進制日誌事件(binlog events);

(2)識別含有最新更新的slave;

(3)應用差異的中繼日誌(relay log)到其他的slave;

(4)應用從master保存的二進制日誌事件(binlog events);

(5)提升一個slave爲新的master;

(6)使其他的slave連接新的master進行復制;

MHA軟件由兩部分組成,Manager工具包和Node工具包,具體的說明如下。

Manager工具包主要包括以下幾個工具:

複製代碼
masterha_check_ssh              檢查MHA的SSH配置狀況
masterha_check_repl             檢查MySQL複製狀況
masterha_manger                 啓動MHA
masterha_check_status           檢測當前MHA運行狀態
masterha_master_monitor         檢測master是否宕機
masterha_master_switch          控制故障轉移(自動或者手動)
masterha_conf_host              添加或刪除配置的server信息
複製代碼

Node工具包(這些工具通常由MHA Manager的腳本觸發,無需人爲操作)主要包括以下幾個工具:

save_binary_logs                保存和複製master的二進制日誌
apply_diff_relay_logs           識別差異的中繼日誌事件並將其差異的事件應用於其他的slave
filter_mysqlbinlog              去除不必要的ROLLBACK事件(MHA已不再使用這個工具)
purge_relay_logs                清除中繼日誌(不會阻塞SQL線程)

注意:

爲了儘可能的減少主庫硬件損壞宕機造成的數據丟失,因此在配置MHA的同時建議配置成MySQL 5.5的半同步複製。關於半同步複製原理各位自己進行查閱。(不是必須)

1.部署MHA

接下來部署MHA,具體的搭建環境如下(所有操作系統均爲centos 6.2 64bit,不是必須,server03和server04是server02的從,複製環境搭建後面會簡單演示,但是相關的安全複製不會詳細說明,需要的童鞋請參考前面的文章,MySQL Replication需要注意的問題):

角色                    ip地址          主機名          server_id                  類型
Monitor host            192.168.0.20    server01            -                      監控複製組
Master                  192.168.0.50    server02            1                      寫入
Candicate master        192.168.0.60    server03            2                      讀
Slave                   192.168.0.70    server04            3

其中master對外提供寫服務,備選master(實際的slave,主機名server03)提供讀服務,slave也提供相關的讀服務,一旦master宕機,將會把備選master提升爲新的master,slave指向新的master

(1)在所有節點安裝MHA node所需的perl模塊(DBD:mysql),安裝腳本如下:

複製代碼
[root@192.168.0.50 ~]# cat install.sh 
#!/bin/bash
wget http://xrl.us/cpanm --no-check-certificate
mv cpanm /usr/bin
chmod 755 /usr/bin/cpanm
cat > /root/list << EOF
install DBD::mysql
EOF
for package in `cat /root/list`
do
    cpanm $package
done
[root@192.168.0.50 ~]# 
複製代碼

如果有安裝epel源,也可以使用yum安裝

yum install perl-DBD-MySQL -y

(2)在所有的節點安裝mha node:

wget http://mysql-master-ha.googlecode.com/files/mha4mysql-node-0.53.tar.gz
tar xf mha4mysql-node-0.53.tar.gz
cd mha4mysql-node-0.53
perl Makefile.PL
make && make install

安裝完成後會在/usr/local/bin目錄下生成以下腳本文件:

複製代碼
[root@192.168.0.50 bin]# pwd
/usr/local/bin
[root@192.168.0.50 bin]# ll
total 40
-r-xr-xr-x 1 root root 15498 Apr 20 10:05 apply_diff_relay_logs
-r-xr-xr-x 1 root root  4807 Apr 20 10:05 filter_mysqlbinlog
-r-xr-xr-x 1 root root  7401 Apr 20 10:05 purge_relay_logs
-r-xr-xr-x 1 root root  7263 Apr 20 10:05 save_binary_logs
[root@192.168.0.50 bin]# 
複製代碼

關於上面腳本的功能,上面已經介紹過了,這裏不再重複了。

2.安裝MHA Manager

MHA Manager中主要包括了幾個管理員的命令行工具,例如master_manger,master_master_switch等。MHA Manger也依賴於perl模塊,具體如下:

(1)安裝MHA Node軟件包之前需要安裝依賴。我這裏使用yum完成,沒有epel源的可以使用上面提到的腳本(epel源安裝也簡單)。注意:在MHA Manager的主機也是需要安裝MHA Node。

rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
yum install perl-DBD-MySQL -y

安裝MHA Node軟件包,和上面的方法一樣,如下:

wget http://mysql-master-ha.googlecode.com/files/mha4mysql-node-0.53.tar.gz
tar xf mha4mysql-node-0.53.tar.gz
cd mha4mysql-node-0.53
perl Makefile.PL
make && make install

(2)安裝MHA Manager。首先安裝MHA Manger依賴的perl模塊(我這裏使用yum安裝):

yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes -y

安裝MHA Manager軟件包:

wget http://mysql-master-ha.googlecode.com/files/mha4mysql-manager-0.53.tar.gz
tar xf mha4mysql-manager-0.53.tar.gz 
cd mha4mysql-manager-0.53
perl Makefile.PL
make && make install

安裝完成後會在/usr/local/bin目錄下面生成以下腳本文件,前面已經說過這些腳本的作用,這裏不再重複

複製代碼
[root@192.168.0.20 bin]# pwd
/usr/local/bin
[root@192.168.0.20 bin]# ll
total 76
-r-xr-xr-x 1 root root 15498 Apr 20 10:58 apply_diff_relay_logs
-r-xr-xr-x 1 root root  4807 Apr 20 10:58 filter_mysqlbinlog
-r-xr-xr-x 1 root root  1995 Apr 20 11:33 masterha_check_repl
-r-xr-xr-x 1 root root  1779 Apr 20 11:33 masterha_check_ssh
-r-xr-xr-x 1 root root  1865 Apr 20 11:33 masterha_check_status
-r-xr-xr-x 1 root root  3201 Apr 20 11:33 masterha_conf_host
-r-xr-xr-x 1 root root  2517 Apr 20 11:33 masterha_manager
-r-xr-xr-x 1 root root  2165 Apr 20 11:33 masterha_master_monitor
-r-xr-xr-x 1 root root  2373 Apr 20 11:33 masterha_master_switch
-r-xr-xr-x 1 root root  3749 Apr 20 11:33 masterha_secondary_check
-r-xr-xr-x 1 root root  1739 Apr 20 11:33 masterha_stop
-r-xr-xr-x 1 root root  7401 Apr 20 10:58 purge_relay_logs
-r-xr-xr-x 1 root root  7263 Apr 20 10:58 save_binary_logs
[root@192.168.0.20 bin]# 
複製代碼

複製相關腳本到/usr/local/bin目錄(軟件包解壓縮後就有了,不是必須,因爲這些腳本不完整,需要自己修改,這是軟件開發着留給我們自己發揮的,如果開啓下面的任何一個腳本對應的參數,而對應這裏的腳本又沒有修改,則會拋錯,自己被坑的很慘)

複製代碼
[root@192.168.0.20 scripts]# pwd
/root/mha4mysql-manager-0.53/samples/scripts
[root@192.168.0.20 scripts]# ll
total 32
-rwxr-xr-x 1 root root  3443 Jan  8  2012 master_ip_failover                #自動切換時vip管理的腳本,不是必須,如果我們使用keepalived的,我們可以自己編寫腳本完成對vip的管理,比如監控mysql,如果mysql異常,我們停止keepalived就行,這樣vip就會自動漂移
-rwxr-xr-x 1 root root  9186 Jan  8  2012 master_ip_online_change           #在線切換時vip的管理,不是必須,同樣可以可以自行編寫簡單的shell完成
-rwxr-xr-x 1 root root 11867 Jan  8  2012 power_manager                     #故障發生後關閉主機的腳本,不是必須
-rwxr-xr-x 1 root root  1360 Jan  8  2012 send_report                       #因故障切換後發送報警的腳本,不是必須,可自行編寫簡單的shell完成。
[root@192.168.0.20 scripts]# cp * /usr/local/bin/
[root@192.168.0.20 scripts]# 
複製代碼

3.配置SSH登錄無密碼驗證(使用key登錄,工作中常用)我的測試環境已經是使用key登錄,服務器之間無需密碼驗證的。關於配置使用key登錄,我想我不再重複。但是有一點需要注意:不能禁止 password 登陸,否則會出現錯誤

4.搭建主從複製環境

注意:binlog-do-db 和 replicate-ignore-db 設置必須相同。 MHA 在啓動時候會檢測過濾規則,如果過濾規則不同,MHA 不啓動監控和故障轉移。

(1)在server02上執行備份(192.168.0.50)

[root@192.168.0.50 ~]# mysqldump --master-data=2 --single-transaction -R --triggers -A > all.sql

其中--master-data=2代表備份時刻記錄master的Binlog位置和Position,--single-transaction意思是獲取一致性快照,-R意思是備份存儲過程和函數,--triggres的意思是備份觸發器,-A代表備份所有的庫。更多信息請自行mysqldump --help查看。

(2)在server02上創建複製用戶:

複製代碼
mysql> grant replication slave on *.* to 'repl'@'192.168.0.%' identified by '123456';
Query OK, 0 rows affected (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

mysql> 
複製代碼

(3)查看主庫備份時的binlog名稱和位置,MASTER_LOG_FILE和MASTER_LOG_POS:

[root@192.168.0.50 ~]# head -n 30 all.sql | grep 'CHANGE MASTER TO'
-- CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000010', MASTER_LOG_POS=112;
[root@192.168.0.50 ~]# 

(4)把備份複製到server03和server04,也就是192.168.0.60和192.168.0.70

scp all.sql server03:/data/
scp all.sql server04:/data/

(5)導入備份到server03,執行復制相關命令

mysql < /data/all.sql
複製代碼
mysql> CHANGE MASTER TO MASTER_HOST='192.168.0.50',MASTER_USER='repl', MASTER_PASSWORD='123456',MASTER_LOG_FILE='mysql-bin.000010',MASTER_LOG_POS=112;
Query OK, 0 rows affected (0.02 sec)

mysql> start slave;
Query OK, 0 rows affected (0.01 sec)

mysql> 
複製代碼

查看複製狀態(可以看見覆製成功):

[root@192.168.0.60 ~]# mysql -e 'show slave status\G' | egrep 'Slave_IO|Slave_SQL'
               Slave_IO_State: Waiting for master to send event
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
[root@192.168.0.60 ~]# 

(6)在server04(192.168.0.70)上搭建複製環境,操作和上面一樣。

mysql < /data/all.sql
複製代碼
mysql> CHANGE MASTER TO MASTER_HOST='192.168.0.50',MASTER_USER='repl', MASTER_PASSWORD='123456',MASTER_LOG_FILE='mysql-bin.000010',MASTER_LOG_POS=112;
Query OK, 0 rows affected (0.07 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> 
複製代碼

查看複製狀態:

[root@192.168.0.70 ~]# mysql -e 'show slave status\G' | egrep 'Slave_IO|Slave_SQL'
               Slave_IO_State: Waiting for master to send event
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
[root@192.168.0.70 ~]# 

(7)兩臺slave服務器設置read_only(從庫對外提供讀服務,只所以沒有寫進配置文件,是因爲隨時slave會提升爲master

[root@192.168.0.60 ~]# mysql -e 'set global read_only=1'
[root@192.168.0.60 ~]# 
[root@192.168.0.70 ~]# mysql -e 'set global read_only=1'
[root@192.168.0.70 ~]# 

(8)創建監控用戶(在master上執行,也就是192.168.0.50):

複製代碼
mysql> grant all privileges on *.* to 'root'@'192.168.0.%' identified  by '123456';
Query OK, 0 rows affected (0.00 sec)

mysql> flush  privileges;
Query OK, 0 rows affected (0.01 sec)

mysql> 
複製代碼

到這裏整個集羣環境已經搭建完畢,剩下的就是配置MHA軟件了。

5.配置MHA

(1)創建MHA的工作目錄,並且創建相關配置文件(在軟件包解壓後的目錄裏面有樣例配置文件)。

[root@192.168.0.20 ~]# mkdir -p /etc/masterha
[root@192.168.0.20 ~]# cp mha4mysql-manager-0.53/samples/conf/app1.cnf /etc/masterha/
[root@192.168.0.20 ~]# 

修改app1.cnf配置文件,修改後的文件內容如下(注意,配置文件中的註釋需要去掉,我這裏是爲了解釋清楚):

複製代碼
[root@192.168.0.20 ~]# cat /etc/masterha/app1.cnf 
[server default]
manager_workdir=/var/log/masterha/app1.log              //設置manager的工作目錄
manager_log=/var/log/masterha/app1/manager.log          //設置manager的日誌
master_binlog_dir=/data/mysql                         //設置master 保存binlog的位置,以便MHA可以找到master的日誌,我這裏的也就是mysql的數據目錄
master_ip_failover_script= /usr/local/bin/master_ip_failover    //設置自動failover時候的切換腳本
master_ip_online_change_script= /usr/local/bin/master_ip_online_change  //設置手動切換時候的切換腳本
password=123456         //設置mysql中root用戶的密碼,這個密碼是前文中創建監控用戶的那個密碼
user=root               設置監控用戶root
ping_interval=1         //設置監控主庫,發送ping包的時間間隔,默認是3秒,嘗試三次沒有迴應的時候自動進行railover
remote_workdir=/tmp     //設置遠端mysql在發生切換時binlog的保存位置
repl_password=123456    //設置複製用戶的密碼
repl_user=repl          //設置複製環境中的複製用戶名
report_script=/usr/local/send_report    //設置發生切換後發送的報警的腳本
secondary_check_script= /usr/local/bin/masterha_secondary_check -s server03 -s server02            
shutdown_script="" //設置故障發生後關閉故障主機腳本(該腳本的主要作用是關閉主機放在發生腦裂,這裏沒有使用) ssh_user=root //設置ssh的登錄用戶名 [server1] hostname=192.168.0.50 port=3306 [server2] hostname=192.168.0.60 port=3306 candidate_master=1 //設置爲候選master,如果設置該參數以後,發生主從切換以後將會將此從庫提升爲主庫,即使這個主庫不是集羣中事件最新的slave check_repl_delay=0 //默認情況下如果一個slave落後master 100M的relay logs的話,MHA將不會選擇該slave作爲一個新的master,因爲對於這個slave的恢復需要花費很長時間,通過設置check_repl_delay=0,MHA觸發切換在選擇一個新的master的時候將會忽略複製延時,這個參數對於設置了candidate_master=1的主機非常有用,因爲這個候選主在切換的過程中一定是新的master [server3] hostname=192.168.0.70 port=3306 [root@192.168.0.20 ~]#
複製代碼

(2)設置relay log的清除方式(在每個slave節點上):

[root@192.168.0.60 ~]# mysql -e 'set global relay_log_purge=0'
[root@192.168.0.70 ~]# mysql -e 'set global relay_log_purge=0'

注意:

MHA在發生切換的過程中,從庫的恢復過程中依賴於relay log的相關信息,所以這裏要將relay log的自動清除設置爲OFF,採用手動清除relay log的方式。在默認情況下,從服務器上的中繼日誌會在SQL線程執行完畢後被自動刪除。但是在MHA環境中,這些中繼日誌在恢復其他從服務器時可能會被用到,因此需要禁用中繼日誌的自動刪除功能。定期清除中繼日誌需要考慮到複製延時的問題。在ext3的文件系統下,刪除大的文件需要一定的時間,會導致嚴重的複製延時。爲了避免複製延時,需要暫時爲中繼日誌創建硬鏈接,因爲在linux系統中通過硬鏈接刪除大文件速度會很快。(在mysql數據庫中,刪除大表時,通常也採用建立硬鏈接的方式)

MHA節點中包含了pure_relay_logs命令工具,它可以爲中繼日誌創建硬鏈接,執行SET GLOBAL relay_log_purge=1,等待幾秒鐘以便SQL線程切換到新的中繼日誌,再執行SET GLOBAL relay_log_purge=0。

pure_relay_logs腳本參數如下所示:

複製代碼
--user mysql                      用戶名
--password mysql                  密碼
--port                            端口號
--workdir                         指定創建relay log的硬鏈接的位置,默認是/var/tmp,由於系統不同分區創建硬鏈接文件會失敗,故需要執行硬鏈接具體位置,成功執行腳本後,硬鏈接的中繼日誌文件被刪除
--disable_relay_log_purge         默認情況下,如果relay_log_purge=1,腳本會什麼都不清理,自動退出,通過設定這個參數,當relay_log_purge=1的情況下會將relay_log_purge設置爲0。清理relay log之後,最後將參數設置爲OFF。
複製代碼

(3)設置定期清理relay腳本(兩臺slave服務器)

複製代碼
[root@192.168.0.60 ~]# cat purge_relay_log.sh 
#!/bin/bash
user=root
passwd=123456
port=3306
log_dir='/data/masterha/log'
work_dir='/data'
purge='/usr/local/bin/purge_relay_logs'

if [ ! -d $log_dir ]
then
   mkdir $log_dir -p
fi

$purge --user=$user --password=$passwd --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir/purge_relay_logs.log 2>&1
[root@192.168.0.60 ~]# 
複製代碼

添加到crontab定期執行

[root@192.168.0.60 ~]# crontab -l
0 4 * * * /bin/bash /root/purge_relay_log.sh
[root@192.168.0.60 ~]# 

purge_relay_logs腳本刪除中繼日誌不會阻塞SQL線程。下面我們手動執行看看什麼情況。

複製代碼
[root@192.168.0.60 ~]# purge_relay_logs --user=root --password=123456 --port=3306 -disable_relay_log_purge --workdir=/data/
2014-04-20 15:47:24: purge_relay_logs script started.
 Found relay_log.info: /data/mysql/relay-log.info
 Removing hard linked relay log files server03-relay-bin* under /data/.. done.
 Current relay log file: /data/mysql/server03-relay-bin.000002
 Archiving unused relay log files (up to /data/mysql/server03-relay-bin.000001) ...
 Creating hard link for /data/mysql/server03-relay-bin.000001 under /data//server03-relay-bin.000001 .. ok.
 Creating hard links for unused relay log files completed.
 Executing SET GLOBAL relay_log_purge=1; FLUSH LOGS; sleeping a few seconds so that SQL thread can delete older relay log files (if it keeps up); SET GLOBAL relay_log_purge=0; .. ok.
 Removing hard linked relay log files server03-relay-bin* under /data/.. done.
2014-04-20 15:47:27: All relay log purging operations succeeded.
[root@192.168.0.60 ~]# 
複製代碼

6.檢查SSH配置

檢查MHA Manger到所有MHA Node的SSH連接狀態:

複製代碼
[root@192.168.0.20 ~]# masterha_check_ssh --conf=/etc/masterha/app1.cnf 
Sun Apr 20 17:17:39 2014 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Apr 20 17:17:39 2014 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Sun Apr 20 17:17:39 2014 - [info] Reading server configurations from /etc/masterha/app1.cnf..
Sun Apr 20 17:17:39 2014 - [info] Starting SSH connection tests..
Sun Apr 20 17:17:40 2014 - [debug] 
Sun Apr 20 17:17:39 2014 - [debug]  Connecting via SSH from root@192.168.0.50(192.168.0.50:22) to root@192.168.0.60(192.168.0.60:22)..
Sun Apr 20 17:17:39 2014 - [debug]   ok.
Sun Apr 20 17:17:39 2014 - [debug]  Connecting via SSH from root@192.168.0.50(192.168.0.50:22) to root@192.168.0.70(192.168.0.70:22)..
Sun Apr 20 17:17:39 2014 - [debug]   ok.
Sun Apr 20 17:17:40 2014 - [debug] 
Sun Apr 20 17:17:40 2014 - [debug]  Connecting via SSH from root@192.168.0.60(192.168.0.60:22) to root@192.168.0.50(192.168.0.50:22)..
Sun Apr 20 17:17:40 2014 - [debug]   ok.
Sun Apr 20 17:17:40 2014 - [debug]  Connecting via SSH from root@192.168.0.60(192.168.0.60:22) to root@192.168.0.70(192.168.0.70:22)..
Sun Apr 20 17:17:40 2014 - [debug]   ok.
Sun Apr 20 17:17:41 2014 - [debug] 
Sun Apr 20 17:17:40 2014 - [debug]  Connecting via SSH from root@192.168.0.70(192.168.0.70:22) to root@192.168.0.50(192.168.0.50:22)..
Sun Apr 20 17:17:40 2014 - [debug]   ok.
Sun Apr 20 17:17:40 2014 - [debug]  Connecting via SSH from root@192.168.0.70(192.168.0.70:22) to root@192.168.0.60(192.168.0.60:22)..
Sun Apr 20 17:17:41 2014 - [debug]   ok.
Sun Apr 20 17:17:41 2014 - [info] All SSH connection tests passed successfully.
複製代碼

可以看見各個節點ssh驗證都是ok的。

7.檢查整個複製環境狀況。

通過masterha_check_repl腳本查看整個集羣的狀態

複製代碼
[root@192.168.0.20 ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Sun Apr 20 18:36:55 2014 - [info] Checking replication health on 192.168.0.60..
Sun Apr 20 18:36:55 2014 - [info]  ok.
Sun Apr 20 18:36:55 2014 - [info] Checking replication health on 192.168.0.70..
Sun Apr 20 18:36:55 2014 - [info]  ok.
Sun Apr 20 18:36:55 2014 - [info] Checking master_ip_failover_script status:
Sun Apr 20 18:36:55 2014 - [info]   /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.0.50 --orig_master_ip=192.168.0.50 --orig_master_port=3306 
Bareword "FIXME_xxx" not allowed while "strict subs" in use at /usr/local/bin/master_ip_failover line 88.
Execution of /usr/local/bin/master_ip_failover aborted due to compilation errors.
Sun Apr 20 18:36:55 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln214]  Failed to get master_ip_failover_script status with return code 255:0.
Sun Apr 20 18:36:55 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln383] Error happend on checking configurations.  at /usr/local/bin/masterha_check_repl line 48
Sun Apr 20 18:36:55 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln478] Error happened on monitoring servers.
Sun Apr 20 18:36:55 2014 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!
複製代碼

發現最後的結論說我的複製不是ok的。但是上面的信息明明說是正常的,自己也進數據庫查看了。這裏一直踩坑。一直糾結,後來無意中發現火丁筆記的博客,這才知道了原因,原來Failover兩種方式:一種是虛擬IP地址,一種是全局配置文件。MHA並沒有限定使用哪一種方式,而是讓用戶自己選擇,虛擬IP地址的方式會牽扯到其它的軟件,比如keepalive軟件,而且還要修改腳本master_ip_failover。(最後修改腳本後纔沒有這個報錯,自己不懂perl也是折騰的半死,去年買了塊表)

如果發現如下錯誤:

Can't exec "mysqlbinlog": No such file or directory at /usr/local/share/perl5/MHA/BinlogManager.pm line 99.
mysqlbinlog version not found!
Testing mysql connection and privileges..sh: mysql: command not found

解決方法如下,添加軟連接(所有節點)

ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql

所以先暫時註釋master_ip_failover_script= /usr/local/bin/master_ip_failover這個選項。後面引入keepalived後和修改該腳本以後再開啓該選項。

[root@192.168.0.20 ~]# grep master_ip_failover /etc/masterha/app1.cnf
#master_ip_failover_script= /usr/local/bin/master_ip_failover
[root@192.168.0.20 ~]# 

再次進行狀態查看:

複製代碼
Sun Apr 20 18:46:08 2014 - [info] Checking replication health on 192.168.0.60..
Sun Apr 20 18:46:08 2014 - [info]  ok.
Sun Apr 20 18:46:08 2014 - [info] Checking replication health on 192.168.0.70..
Sun Apr 20 18:46:08 2014 - [info]  ok.
Sun Apr 20 18:46:08 2014 - [warning] master_ip_failover_script is not defined.
Sun Apr 20 18:46:08 2014 - [warning] shutdown_script is not defined.
Sun Apr 20 18:46:08 2014 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.
複製代碼

已經沒有明顯報錯,只有兩個警告而已,複製也顯示正常了。
8.檢查MHA Manager的狀態:

通過master_check_status腳本查看Manager的狀態:

[root@192.168.0.20 ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).
[root@192.168.0.20 ~]# 

注意:如果正常,會顯示"PING_OK",否則會顯示"NOT_RUNNING",這代表MHA監控沒有開啓。
9.開啓MHA Manager監控

[root@192.168.0.20 ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &  
[1] 30867
[root@192.168.0.20 ~]# 

啓動參數介紹:

--remove_dead_master_conf      該參數代表當發生主從切換後,老的主庫的ip將會從配置文件中移除。

--manger_log                            日誌存放位置

--ignore_last_failover                 在缺省情況下,如果MHA檢測到連續發生宕機,且兩次宕機間隔不足8小時的話,則不會進行Failover,之所以這樣限制是爲了避免ping-pong效應。該參數代表忽略上次MHA觸發切換產生的文件,默認情況下,MHA發生切換後會在日誌目錄,也就是上面我設置的/data產生app1.failover.complete文件,下次再次切換的時候如果發現該目錄下存在該文件將不允許觸發切換,除非在第一次切換後收到刪除該文件,爲了方便,這裏設置爲--ignore_last_failover。

查看MHA Manager監控是否正常:

[root@192.168.0.20 ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:20386) is running(0:PING_OK), master:192.168.0.50
[root@192.168.0.20 ~]# 

可以看見已經在監控了,而且master的主機爲192.168.0.50

10.查看啓動日誌

複製代碼
[root@192.168.0.20 ~]# tail -n20 /var/log/masterha/app1/manager.log
Sun Apr 20 19:12:01 2014 - [info]   Connecting to root@192.168.0.70(192.168.0.70:22).. 
  Checking slave recovery environment settings..
    Opening /data/mysql/relay-log.info ... ok.
    Relay log found at /data/mysql, up to server04-relay-bin.000002
    Temporary relay log file is /data/mysql/server04-relay-bin.000002
    Testing mysql connection and privileges.. done.
    Testing mysqlbinlog output.. done.
    Cleaning up test file(s).. done.
Sun Apr 20 19:12:01 2014 - [info] Slaves settings check done.
Sun Apr 20 19:12:01 2014 - [info] 
192.168.0.50 (current master)
 +--192.168.0.60
 +--192.168.0.70

Sun Apr 20 19:12:01 2014 - [warning] master_ip_failover_script is not defined.
Sun Apr 20 19:12:01 2014 - [warning] shutdown_script is not defined.
Sun Apr 20 19:12:01 2014 - [info] Set master ping interval 1 seconds.
Sun Apr 20 19:12:01 2014 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s server03 -s server02 --user=root --master_host=server02 --master_ip=192.168.0.50 --master_port=3306
Sun Apr 20 19:12:01 2014 - [info] Starting ping health check on 192.168.0.50(192.168.0.50:3306)..
Sun Apr 20 19:12:01 2014 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
[root@192.168.0.20 ~]# 
複製代碼

其中"Ping(SELECT) succeeded, waiting until MySQL doesn't respond.."說明整個系統已經開始監控了。
11.關閉MHA Manage監控

關閉很簡單,使用masterha_stop命令完成。

[root@192.168.0.20 ~]# masterha_stop --conf=/etc/masterha/app1.cnf
Stopped app1 successfully.
[1]+  Exit 1                  nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover --manager_log=/data/mamanager.log
[root@192.168.0.20 ~]# 

12.配置VIP
vip配置可以採用兩種方式,一種通過keepalived的方式管理虛擬ip的浮動;另外一種通過腳本方式啓動虛擬ip的方式(即不需要keepalived或者heartbeat類似的軟件)。

1.keepalived方式管理虛擬ip,keepalived配置方法如下:

(1)下載軟件進行並進行安裝(兩臺master,準確的說一臺是master,另外一臺是備選master,在沒有切換以前是slave):

[root@192.168.0.50 ~]# wget http://www.keepalived.org/software/keepalived-1.2.12.tar.gz
複製代碼
tar xf keepalived-1.2.12.tar.gz           
cd keepalived-1.2.12
./configure --prefix=/usr/local/keepalived
make &&  make install
cp /usr/local/keepalived/etc/rc.d/init.d/keepalived /etc/init.d/
cp /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/
mkdir /etc/keepalived
cp /usr/local/keepalived/etc/keepalived/keepalived.conf /etc/keepalived/
cp /usr/local/keepalived/sbin/keepalived /usr/sbin/
複製代碼

(2)配置keepalived的配置文件,在master上配置(192.168.0.50)

複製代碼
[root@192.168.0.50 ~]# cat /etc/keepalived/keepalived.conf
! Configuration File for keepalived

global_defs {
     notification_email {
     saltstack@163.com
   }
   notification_email_from [email protected]
   smtp_server 127.0.0.1
   smtp_connect_timeout 30
   router_id MySQL-HA
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth1
    virtual_router_id 51
    priority 150
    advert_int 1
    nopreempt

    authentication {
    auth_type PASS
    auth_pass 1111
    }

    virtual_ipaddress {
        192.168.0.88
    }
}

[root@192.168.0.50 ~]# 
複製代碼

其中router_id MySQL HA表示設定keepalived組的名稱,將192.168.0.88這個虛擬ip綁定到該主機的eth1網卡上,並且設置了狀態爲backup模式,將keepalived的模式設置爲非搶佔模式(nopreempt),priority 150表示設置的優先級爲150。下面的配置略有不同,但是都是一個意思。
在候選master上配置(192.168.0.60)

複製代碼
[root@192.168.0.60 ~]# cat /etc/keepalived/keepalived.conf 
! Configuration File for keepalived

global_defs {
     notification_email {
     saltstack@163.com
   }
   notification_email_from [email protected]
   smtp_server 127.0.0.1
   smtp_connect_timeout 30
   router_id MySQL-HA
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth1
    virtual_router_id 51
    priority 120
    advert_int 1
    nopreempt

    authentication {
    auth_type PASS
    auth_pass 1111
    }

    virtual_ipaddress {
        192.168.0.88
    }
}

[root@192.168.0.60 ~]# 
複製代碼

(3)啓動keepalived服務,在master上啓動並查看日誌

複製代碼
[root@192.168.0.50 ~]# /etc/init.d/keepalived start
Starting keepalived:                                       [  OK  ]
[root@192.168.0.50 ~]# tail -f /var/log/messages
Apr 20 20:22:16 192 Keepalived_healthcheckers[15334]: Opening file '/etc/keepalived/keepalived.conf'.
Apr 20 20:22:16 192 Keepalived_healthcheckers[15334]: Configuration is using : 7231 Bytes
Apr 20 20:22:16 192 kernel: IPVS: Connection hash table configured (size=4096, memory=64Kbytes)
Apr 20 20:22:16 192 kernel: IPVS: ipvs loaded.
Apr 20 20:22:16 192 Keepalived_healthcheckers[15334]: Using LinkWatch kernel netlink reflector...
Apr 20 20:22:19 192 Keepalived_vrrp[15335]: VRRP_Instance(VI_1) Transition to MASTER STATE
Apr 20 20:22:20 192 Keepalived_vrrp[15335]: VRRP_Instance(VI_1) Entering MASTER STATE
Apr 20 20:22:20 192 Keepalived_vrrp[15335]: VRRP_Instance(VI_1) setting protocol VIPs.
Apr 20 20:22:20 192 Keepalived_vrrp[15335]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth1 for 192.168.0.88
Apr 20 20:22:20 192 Keepalived_healthcheckers[15334]: Netlink reflector reports IP 192.168.0.88 added
Apr 20 20:22:25 192 Keepalived_vrrp[15335]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth1 for 192.168.0.88
複製代碼

發現已經將虛擬ip 192.168.0.88綁定了網卡eth1上。
(4)查看綁定情況

[root@192.168.0.50 ~]# ip addr | grep eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    inet 192.168.0.50/24 brd 192.168.0.255 scope global eth1
    inet 192.168.0.88/32 scope global eth1
[root@192.168.0.50 ~]# 

在另外一臺服務器,候選master上啓動keepalived服務,並觀察

複製代碼
[root@192.168.0.60 ~]# /etc/init.d/keepalived start ; tail -f /var/log/messages
Starting keepalived:                                       [  OK  ]
Apr 20 20:26:18 192 Keepalived_vrrp[9472]: Registering gratuitous ARP shared channel
Apr 20 20:26:18 192 Keepalived_vrrp[9472]: Opening file '/etc/keepalived/keepalived.conf'.
Apr 20 20:26:18 192 Keepalived_vrrp[9472]: Configuration is using : 62976 Bytes
Apr 20 20:26:18 192 Keepalived_vrrp[9472]: Using LinkWatch kernel netlink reflector...
Apr 20 20:26:18 192 Keepalived_vrrp[9472]: VRRP_Instance(VI_1) Entering BACKUP STATE
Apr 20 20:26:18 192 Keepalived_vrrp[9472]: VRRP sockpool: [ifindex(3), proto(112), unicast(0), fd(10,11)]
Apr 20 20:26:18 192 Keepalived_healthcheckers[9471]: Netlink reflector reports IP 192.168.80.138 added
Apr 20 20:26:18 192 Keepalived_healthcheckers[9471]: Netlink reflector reports IP 192.168.0.60 added
Apr 20 20:26:18 192 Keepalived_healthcheckers[9471]: Netlink reflector reports IP fe80::20c:29ff:fe9d:6a9e added
Apr 20 20:26:18 192 Keepalived_healthcheckers[9471]: Netlink reflector reports IP fe80::20c:29ff:fe9d:6aa8 added
Apr 20 20:26:18 192 Keepalived_healthcheckers[9471]: Registering Kernel netlink reflector
Apr 20 20:26:18 192 Keepalived_healthcheckers[9471]: Registering Kernel netlink command channel
Apr 20 20:26:18 192 Keepalived_healthcheckers[9471]: Opening file '/etc/keepalived/keepalived.conf'.
Apr 20 20:26:18 192 Keepalived_healthcheckers[9471]: Configuration is using : 7231 Bytes
Apr 20 20:26:18 192 kernel: IPVS: Registered protocols (TCP, UDP, AH, ESP)
Apr 20 20:26:18 192 kernel: IPVS: Connection hash table configured (size=4096, memory=64Kbytes)
Apr 20 20:26:18 192 kernel: IPVS: ipvs loaded.
Apr 20 20:26:18 192 Keepalived_healthcheckers[9471]: Using LinkWatch kernel netlink reflector...
複製代碼

從上面的信息可以看到keepalived已經配置成功。
注意:

上面兩臺服務器的keepalived都設置爲了BACKUP模式,在keepalived中2種模式,分別是master->backup模式和backup->backup模式。這兩種模式有很大區別。在master->backup模式下,一旦主庫宕機,虛擬ip會自動漂移到從庫,當主庫修復後,keepalived啓動後,還會把虛擬ip搶佔過來,即使設置了非搶佔模式(nopreempt)搶佔ip的動作也會發生。在backup->backup模式下,當主庫宕機後虛擬ip會自動漂移到從庫上,當原主庫恢復和keepalived服務啓動後,並不會搶佔新主的虛擬ip,即使是優先級高於從庫的優先級別,也不會發生搶佔。爲了減少ip漂移次數,通常是把修復好的主庫當做新的備庫。

(5)MHA引入keepalived(MySQL服務進程掛掉時通過MHA 停止keepalived):

要想把keepalived服務引入MHA,我們只需要修改切換是觸發的腳本文件master_ip_failover即可,在該腳本中添加在master發生宕機時對keepalived的處理。

編輯腳本/usr/local/bin/master_ip_failover,修改後如下,我對perl不熟悉,所以我這裏完整貼出該腳本(主庫上操作,192.168.0.50)。

在MHA Manager修改腳本修改後的內容如下(參考資料比較少):

複製代碼
#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
    $command,          $ssh_user,        $orig_master_host, $orig_master_ip,
    $orig_master_port, $new_master_host, $new_master_ip,    $new_master_port
);

my $vip = '192.168.0.88';
my $ssh_start_vip = "/etc/init.d/keepalived start";
my $ssh_stop_vip = "/etc/init.d/keepalived stop";

GetOptions(
    'command=s'          => \$command,
    'ssh_user=s'         => \$ssh_user,
    'orig_master_host=s' => \$orig_master_host,
    'orig_master_ip=s'   => \$orig_master_ip,
    'orig_master_port=i' => \$orig_master_port,
    'new_master_host=s'  => \$new_master_host,
    'new_master_ip=s'    => \$new_master_ip,
    'new_master_port=i'  => \$new_master_port,
);

exit &main();

sub main {

    print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

    if ( $command eq "stop" || $command eq "stopssh" ) {

        my $exit_code = 1;
        eval {
            print "Disabling the VIP on old master: $orig_master_host \n";
            &stop_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn "Got Error: $@\n";
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "start" ) {

        my $exit_code = 10;
        eval {
            print "Enabling the VIP - $vip on the new master - $new_master_host \n";
            &start_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn $@;
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "status" ) {
        print "Checking the Status of the script.. OK \n";
        #`ssh $ssh_user\@cluster1 \" $ssh_start_vip \"`;
        exit 0;
    }
    else {
        &usage();
        exit 1;
    }
}

# A simple system call that enable the VIP on the new master
sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
# A simple system call that disable the VIP on the old_master
sub stop_vip() {
return 0  unless  ($ssh_user); `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`; } sub usage { print
"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n"; }
複製代碼

現在已經修改這個腳本了,我們現在打開在上面提到過的參數,再檢查集羣狀態,看是否會報錯。

[root@192.168.0.20 ~]# grep 'master_ip_failover_script' /etc/masterha/app1.cnf
master_ip_failover_script= /usr/local/bin/master_ip_failover
[root@192.168.0.20 ~]# 
複製代碼
[root@192.168.0.20 ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf  
Sun Apr 20 23:10:01 2014 - [info] Slaves settings check done.
Sun Apr 20 23:10:01 2014 - [info] 
192.168.0.50 (current master)
 +--192.168.0.60
 +--192.168.0.70

Sun Apr 20 23:10:01 2014 - [info] Checking replication health on 192.168.0.60..
Sun Apr 20 23:10:01 2014 - [info]  ok.
Sun Apr 20 23:10:01 2014 - [info] Checking replication health on 192.168.0.70..
Sun Apr 20 23:10:01 2014 - [info]  ok.
Sun Apr 20 23:10:01 2014 - [info] Checking master_ip_failover_script status:
Sun Apr 20 23:10:01 2014 - [info]   /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.0.50 --orig_master_ip=192.168.0.50 --orig_master_port=3306 
Sun Apr 20 23:10:01 2014 - [info]  OK.
Sun Apr 20 23:10:01 2014 - [warning] shutdown_script is not defined.
Sun Apr 20 23:10:01 2014 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.
複製代碼

可以看見已經沒有報錯了。哈哈
 /usr/local/bin/master_ip_failover添加或者修改的內容意思是當主庫數據庫發生故障時,會觸發MHA切換,MHA Manager會停掉主庫上的keepalived服務,觸發虛擬ip漂移到備選從庫,從而完成切換。當然可以在keepalived裏面引入腳本,這個腳本監控mysql是否正常運行,如果不正常,則調用該腳本殺掉keepalived進程。

2.通過腳本的方式管理VIP。這裏是修改/usr/local/bin/master_ip_failover,也可以使用其他的語言完成,比如php語言。使用php腳本編寫的failover這裏就不介紹了。修改完成後內容如下,而且如果使用腳本管理vip的話,需要手動在master服務器上綁定一個vip(發現修改修改對perl竟然有感覺了。難道我適合學Perl?^_^)

[root@192.168.0.50 ~]# /sbin/ifconfig eth1:1 192.168.0.88/24

通過腳本來維護vip的測試我這裏就不說明了,童鞋們自行測試,腳本如下(測試通過)

複製代碼
#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
    $command,          $ssh_user,        $orig_master_host, $orig_master_ip,
    $orig_master_port, $new_master_host, $new_master_ip,    $new_master_port
);

my $vip = '192.168.0.88/24';
my $key = '1';
my $ssh_start_vip = "/sbin/ifconfig eth1:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig eth1:$key down";

GetOptions(
    'command=s'          => \$command,
    'ssh_user=s'         => \$ssh_user,
    'orig_master_host=s' => \$orig_master_host,
    'orig_master_ip=s'   => \$orig_master_ip,
    'orig_master_port=i' => \$orig_master_port,
    'new_master_host=s'  => \$new_master_host,
    'new_master_ip=s'    => \$new_master_ip,
    'new_master_port=i'  => \$new_master_port,
);

exit &main();

sub main {

    print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

    if ( $command eq "stop" || $command eq "stopssh" ) {

        my $exit_code = 1;
        eval {
            print "Disabling the VIP on old master: $orig_master_host \n";
            &stop_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn "Got Error: $@\n";
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "start" ) {

        my $exit_code = 10;
        eval {
            print "Enabling the VIP - $vip on the new master - $new_master_host \n";
            &start_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn $@;
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "status" ) {
        print "Checking the Status of the script.. OK \n";
        exit 0;
    }
    else {
        &usage();
        exit 1;
    }
}

sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
sub stop_vip() {
 return 0  unless  ($ssh_user); `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`; } sub usage { print
"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n"; }
複製代碼

爲了防止腦裂發生,推薦生產環境採用腳本的方式來管理虛擬ip,而不是使用keepalived來完成。到此爲止,基本MHA集羣已經配置完畢。接下來就是實際的測試環節了。通過一些測試來看一下MHA到底是如何進行工作的。下面將從MHA自動failover,我們手動failover,在線切換三種方式來介紹MHA的工作情況。

一.自動Failover(必須先啓動MHA Manager,否則無法自動切換,當然手動切換不需要開啓MHA Manager監控。各位童鞋請參考前面啓動MHA Manager)

測試環境再次貼一下,文章太長,自己都搞暈了。

角色                    ip地址          主機名          server_id               類型
Monitor host            192.168.0.20    server01            -                   監控複製組
Master                  192.168.0.50    server02            1                   寫入
Candicate master        192.168.0.60    server03            2                   讀
Slave                   192.168.0.70    server04            3

自動failover模擬測試的操作步驟如下。
(1)使用sysbench生成測試數據(使用yum快速安裝)

 yum install sysbench -y

在主庫(192.168.0.50)上進行sysbench數據生成,在sbtest庫下生成sbtest表,共100W記錄。

[root@192.168.0.50 ~]# sysbench --test=oltp --oltp-table-size=1000000 --oltp-read-only=off --init-rng=on --num-threads=16 --max-requests=0 --oltp-dist-type=uniform --max-time=1800 --mysql-user=root --mysql-socket=/tmp/mysql.sock --mysql-password=123456 --db-driver=mysql --mysql-table-engine=innodb --oltp-test-mode=complex prepare

(2)停掉slave sql線程,模擬主從延時。(192.168.0.60)

mysql> stop slave io_thread;
Query OK, 0 rows affected (0.08 sec)

mysql> 

另外一臺slave我們沒有停止io線程,所以還在繼續接收日誌。

(3)模擬sysbench壓力測試。

在主庫上(192.168.0.50)進行壓力測試,持續時間爲3分鐘,產生大量的binlog。

複製代碼
[root@192.168.0.50 ~]# sysbench --test=oltp --oltp-table-size=1000000 --oltp-read-only=off --init-rng=on --num-threads=16 --max-requests=0 --oltp-dist-type=uniform --max-time=180 --mysql-user=root --mysql-socket=/tmp/mysql.sock --mysql-password=123456 --db-driver=mysql --mysql-table-engine=innodb --oltp-test-mode=complex run 
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 16
Initializing random number generator from timer.


Doing OLTP test.
Running mixed OLTP test
Using Uniform distribution
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Threads started!
Time limit exceeded, exiting...
(last message repeated 15 times)
Done.

OLTP test statistics:
    queries performed:
        read:                            15092
        write:                           5390
        other:                           2156
        total:                           22638
    transactions:                        1078   (5.92 per sec.)
    deadlocks:                           0      (0.00 per sec.)
    read/write requests:                 20482  (112.56 per sec.)
    other operations:                    2156   (11.85 per sec.)

Test execution summary:
    total time:                          181.9728s
    total number of events:              1078
    total time taken by event execution: 2910.4518
    per-request statistics:
         min:                                934.29ms
         avg:                               2699.86ms
         max:                               7679.95ms
         approx.  95 percentile:            4441.47ms

Threads fairness:
    events (avg/stddev):           67.3750/1.49
    execution time (avg/stddev):   181.9032/0.11
複製代碼

(4)開啓slave(192.168.0.60)上的IO線程,追趕落後於master的binlog。

mysql> start slave io_thread;     
Query OK, 0 rows affected (0.00 sec)

mysql> 

(5)殺掉主庫mysql進程,模擬主庫發生故障,進行自動failover操作。

[root@192.168.0.50 ~]# pkill -9 mysqld

(6)查看MHA切換日誌,瞭解整個切換過程,在192.168.0.20上查看日誌:

複製代碼
[root@192.168.0.20 ~]# cat /var/log/masterha/app1/manager.log 
Mon Apr 21 20:15:45 2014 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away)
Mon Apr 21 20:15:45 2014 - [info] Executing seconary network check script: /usr/local/bin/masterha_secondary_check -s server03 -s server02 --user=root --master_host=server02 --master_ip=192.168.0.50 --master_  Creating /tmp if not exists..    ok.
  Checking output directory is accessible or not..
   ok.
  Binlog found at /data/mysql, up to mysql-bin.000018
Mon Apr 21 20:15:48 2014 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Mon Apr 21 20:15:48 2014 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Mon Apr 21 20:15:48 2014 - [info] Reading server configurations from /etc/masterha/app1.cnf..
ble from server03. OK.
Monitoring server server02 is reachable, Master is not reachable from server02. OK.
Mon Apr 21 20:15:46 2014 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Mon Apr 21 20:15:46 2014 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Mon Apr 21 20:15:46 2014 - [warning] Connection failed 1 time(s)..
Mon Apr 21 20:15:47 2014 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Mon Apr 21 20:15:47 2014 - [warning] Connection failed 2 time(s)..
Mon Apr 21 20:15:48 2014 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Mon Apr 21 20:15:48 2014 - [warning] Connection failed 3 time(s)..
Mon Apr 21 20:15:48 2014 - [warning] Master is not reachable from health checker!
Mon Apr 21 20:15:48 2014 - [warning] Master 192.168.0.50(192.168.0.50:3306) is not reachable!
Mon Apr 21 20:15:48 2014 - [warning] SSH is reachable.
Mon Apr 21 20:15:48 2014 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.cnf again, and trying to connect to all servers to check server status..
Mon Apr 21 20:15:48 2014 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Mon Apr 21 20:15:48 2014 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Mon Apr 21 20:15:48 2014 - [info] Reading server configurations from /etc/masterha/app1.cnf..
Mon Apr 21 20:15:48 2014 - [info] Dead Servers:
Mon Apr 21 20:15:48 2014 - [info]   192.168.0.50(192.168.0.50:3306)
Mon Apr 21 20:15:48 2014 - [info] Alive Servers:
Mon Apr 21 20:15:48 2014 - [info]   192.168.0.60(192.168.0.60:3306)
Mon Apr 21 20:15:48 2014 - [info]   192.168.0.70(192.168.0.70:3306)
Mon Apr 21 20:15:48 2014 - [info] Alive Slaves:
Mon Apr 21 20:15:48 2014 - [info]   192.168.0.60(192.168.0.60:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 20:15:48 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 20:15:48 2014 - [info]     Primary candidate for the new Master (candidate_master is set)
Mon Apr 21 20:15:48 2014 - [info]   192.168.0.70(192.168.0.70:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 20:15:48 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 20:15:48 2014 - [info] Checking slave configurations..
Mon Apr 21 20:15:48 2014 - [info] Checking replication filtering settings..
Mon Apr 21 20:15:48 2014 - [info]  Replication filtering check ok.
Mon Apr 21 20:15:48 2014 - [info] Master is down!
Mon Apr 21 20:15:48 2014 - [info] Terminating monitoring script.
Mon Apr 21 20:15:48 2014 - [info] Got exit code 20 (Master dead).
Mon Apr 21 20:15:48 2014 - [info] MHA::MasterFailover version 0.53.
Mon Apr 21 20:15:48 2014 - [info] Starting master failover.
Mon Apr 21 20:15:48 2014 - [info] 
Mon Apr 21 20:15:48 2014 - [info] * Phase 1: Configuration Check Phase..
Mon Apr 21 20:15:48 2014 - [info] 
Mon Apr 21 20:15:48 2014 - [info] Dead Servers:
Mon Apr 21 20:15:48 2014 - [info]   192.168.0.50(192.168.0.50:3306)
Mon Apr 21 20:15:48 2014 - [info] Checking master reachability via mysql(double check)..
Mon Apr 21 20:15:48 2014 - [info]  ok.
Mon Apr 21 20:15:48 2014 - [info] Alive Servers:
Mon Apr 21 20:15:48 2014 - [info]   192.168.0.60(192.168.0.60:3306)
Mon Apr 21 20:15:48 2014 - [info]   192.168.0.70(192.168.0.70:3306)
Mon Apr 21 20:15:48 2014 - [info] Alive Slaves:
Mon Apr 21 20:15:48 2014 - [info]   192.168.0.60(192.168.0.60:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 20:15:48 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 20:15:48 2014 - [info]     Primary candidate for the new Master (candidate_master is set)
Mon Apr 21 20:15:48 2014 - [info]   192.168.0.70(192.168.0.70:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 20:15:48 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 20:15:49 2014 - [info] ** Phase 1: Configuration Check Phase completed.
Mon Apr 21 20:15:49 2014 - [info] 
Mon Apr 21 20:15:49 2014 - [info] * Phase 2: Dead Master Shutdown Phase..
Mon Apr 21 20:15:49 2014 - [info] 
Mon Apr 21 20:15:49 2014 - [info] Forcing shutdown so that applications never connect to the current master..
Mon Apr 21 20:15:49 2014 - [info] Executing master IP deactivatation script:
Mon Apr 21 20:15:49 2014 - [info]   /usr/local/bin/master_ip_failover --orig_master_host=192.168.0.50 --orig_master_ip=192.168.0.50 --orig_master_port=3306 --command=stopssh --ssh_user=root  


IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===

Disabling the VIP on old master: 192.168.0.50 
Mon Apr 21 20:15:49 2014 - [info]  done.
Mon Apr 21 20:15:49 2014 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Mon Apr 21 20:15:49 2014 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Mon Apr 21 20:15:49 2014 - [info] 
Mon Apr 21 20:15:49 2014 - [info] * Phase 3: Master Recovery Phase..
Mon Apr 21 20:15:49 2014 - [info] 
Mon Apr 21 20:15:49 2014 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Mon Apr 21 20:15:49 2014 - [info] 
Mon Apr 21 20:15:49 2014 - [info] The latest binary log file/position on all slaves is mysql-bin.000018:112
Mon Apr 21 20:15:49 2014 - [info] Latest slaves (Slaves that received relay log files to the latest):
Mon Apr 21 20:15:49 2014 - [info]   192.168.0.60(192.168.0.60:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 20:15:49 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 20:15:49 2014 - [info]     Primary candidate for the new Master (candidate_master is set)
Mon Apr 21 20:15:49 2014 - [info]   192.168.0.70(192.168.0.70:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 20:15:49 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 20:15:49 2014 - [info] The oldest binary log file/position on all slaves is mysql-bin.000018:112
Mon Apr 21 20:15:49 2014 - [info] Oldest slaves:
Mon Apr 21 20:15:49 2014 - [info]   192.168.0.60(192.168.0.60:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 20:15:49 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 20:15:49 2014 - [info]     Primary candidate for the new Master (candidate_master is set)
Mon Apr 21 20:15:49 2014 - [info]   192.168.0.70(192.168.0.70:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 20:15:49 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 20:15:49 2014 - [info] 
Mon Apr 21 20:15:49 2014 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Mon Apr 21 20:15:49 2014 - [info] 
Mon Apr 21 20:15:49 2014 - [info] Fetching dead master's binary logs..
Mon Apr 21 20:15:49 2014 - [info] Executing command on the dead master 192.168.0.50(192.168.0.50:3306): save_binary_logs --command=save --start_file=mysql-bin.000018  --start_pos=112 --binlog_dir=/data/mysql --output_file=/tmp/saved_master_binlog_from_192.168.0.50_3306_20140421201548.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53
  Creating /tmp if not exists..    ok.
 Concat binary/relay logs from mysql-bin.000018 pos 112 to mysql-bin.000018 EOF into /tmp/saved_master_binlog_from_192.168.0.50_3306_20140421201548.binlog ..
  Dumping binlog format description event, from position 0 to 112.. ok.
  Dumping effective binlog data from /data/mysql/mysql-bin.000018 position 112 to tail(131).. ok.
 Concat succeeded.
Mon Apr 21 20:15:50 2014 - [info] scp from root@192.168.0.50:/tmp/saved_master_binlog_from_192.168.0.50_3306_20140421201548.binlog to local:/var/log/masterha/app1.log/saved_master_binlog_from_192.168.0.50_3306_20140421201548.binlog succeeded.
Mon Apr 21 20:15:50 2014 - [info] HealthCheck: SSH to 192.168.0.60 is reachable.
Mon Apr 21 20:15:50 2014 - [info] HealthCheck: SSH to 192.168.0.70 is reachable.
Mon Apr 21 20:15:50 2014 - [info] 
Mon Apr 21 20:15:50 2014 - [info] * Phase 3.3: Determining New Master Phase..
Mon Apr 21 20:15:50 2014 - [info] 
Mon Apr 21 20:15:50 2014 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Mon Apr 21 20:15:50 2014 - [info] All slaves received relay logs to the same position. No need to resync each other.
Mon Apr 21 20:15:50 2014 - [info] Searching new master from slaves..
Mon Apr 21 20:15:50 2014 - [info]  Candidate masters from the configuration file:
Mon Apr 21 20:15:50 2014 - [info]   192.168.0.60(192.168.0.60:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 20:15:50 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 20:15:50 2014 - [info]     Primary candidate for the new Master (candidate_master is set)
Mon Apr 21 20:15:50 2014 - [info]  Non-candidate masters:
Mon Apr 21 20:15:50 2014 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Mon Apr 21 20:15:50 2014 - [info] New master is 192.168.0.60(192.168.0.60:3306)
Mon Apr 21 20:15:50 2014 - [info] Starting master failover..
Mon Apr 21 20:15:50 2014 - [info] 
From:
192.168.0.50 (current master)
 +--192.168.0.60
 +--192.168.0.70

To:
192.168.0.60 (new master)
 +--192.168.0.70
Mon Apr 21 20:15:50 2014 - [info] 
Mon Apr 21 20:15:50 2014 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Mon Apr 21 20:15:50 2014 - [info] 
Mon Apr 21 20:15:50 2014 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Mon Apr 21 20:15:50 2014 - [info] Sending binlog..
Mon Apr 21 20:15:51 2014 - [info] scp from local:/var/log/masterha/app1.log/saved_master_binlog_from_192.168.0.50_3306_20140421201548.binlog to root@192.168.0.60:/tmp/saved_master_binlog_from_192.168.0.50_3306_20140421201548.binlog succeeded.
Mon Apr 21 20:15:51 2014 - [info] 
Mon Apr 21 20:15:51 2014 - [info] * Phase 3.4: Master Log Apply Phase..
Mon Apr 21 20:15:51 2014 - [info] 
Mon Apr 21 20:15:51 2014 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Mon Apr 21 20:15:51 2014 - [info] Starting recovery on 192.168.0.60(192.168.0.60:3306)..
Mon Apr 21 20:15:51 2014 - [info]  Generating diffs succeeded.
Mon Apr 21 20:15:51 2014 - [info] Waiting until all relay logs are applied.
Mon Apr 21 20:15:51 2014 - [info]  done.
Mon Apr 21 20:15:51 2014 - [info] Getting slave status..
Mon Apr 21 20:15:51 2014 - [info] This slave(192.168.0.60)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000018:112). No need to recover from Exec_Master_Log_Pos.
Mon Apr 21 20:15:51 2014 - [info] Connecting to the target slave host 192.168.0.60, running recover script..
Mon Apr 21 20:15:51 2014 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user=root --slave_host=192.168.0.60 --slave_ip=192.168.0.60  --slave_port=3306 --apply_files=/tmp/saved_master_binlog_from_192.168.0.50_3306_20140421201548.binlog --workdir=/tmp --target_version=5.5.19-ndb-7.2.4-gpl-log --timestamp=20140421201548 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53 --slave_pass=xxx
Mon Apr 21 20:15:51 2014 - [info] 
Applying differential binary/relay log files /tmp/saved_master_binlog_from_192.168.0.50_3306_20140421201548.binlog on 192.168.0.60:3306. This may take long time...
Applying log files succeeded.
Mon Apr 21 20:15:51 2014 - [info]  All relay logs were successfully applied.
Mon Apr 21 20:15:51 2014 - [info] Getting new master's binlog name and position..
Mon Apr 21 20:15:51 2014 - [info]  mysql-bin.000022:506716
Mon Apr 21 20:15:51 2014 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.0.60', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000022', MASTER_LOG_POS=506716, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Mon Apr 21 20:15:51 2014 - [info] Executing master IP activate script:
Mon Apr 21 20:15:51 2014 - [info]   /usr/local/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.0.50 --orig_master_ip=192.168.0.50 --orig_master_port=3306 --new_master_host=192.168.0.60 --new_master_ip=192.168.0.60 --new_master_port=3306  


IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===

Enabling the VIP - 192.168.0.88 on the new master - 192.168.0.60 
Mon Apr 21 20:15:52 2014 - [info]  OK.
Mon Apr 21 20:15:52 2014 - [info] Setting read_only=0 on 192.168.0.60(192.168.0.60:3306)..
Mon Apr 21 20:15:52 2014 - [info]  ok.
Mon Apr 21 20:15:52 2014 - [info] ** Finished master recovery successfully.
Mon Apr 21 20:15:52 2014 - [info] * Phase 3: Master Recovery Phase completed.
Mon Apr 21 20:15:52 2014 - [info] 
Mon Apr 21 20:15:52 2014 - [info] * Phase 4: Slaves Recovery Phase..
Mon Apr 21 20:15:52 2014 - [info] 
Mon Apr 21 20:15:52 2014 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Mon Apr 21 20:15:52 2014 - [info] 
Mon Apr 21 20:15:52 2014 - [info] -- Slave diff file generation on host 192.168.0.70(192.168.0.70:3306) started, pid: 31321. Check tmp log /var/log/masterha/app1.log/192.168.0.70_3306_20140421201548.log if it takes time..
Mon Apr 21 20:15:52 2014 - [info] 
Mon Apr 21 20:15:52 2014 - [info] Log messages from 192.168.0.70 ...
Mon Apr 21 20:15:52 2014 - [info] 
Mon Apr 21 20:15:52 2014 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Mon Apr 21 20:15:52 2014 - [info] End of log messages from 192.168.0.70.
Mon Apr 21 20:15:52 2014 - [info] -- 192.168.0.70(192.168.0.70:3306) has the latest relay log events.
Mon Apr 21 20:15:52 2014 - [info] Generating relay diff files from the latest slave succeeded.
Mon Apr 21 20:15:52 2014 - [info] 
Mon Apr 21 20:15:52 2014 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Mon Apr 21 20:15:52 2014 - [info] 
Mon Apr 21 20:15:52 2014 - [info] -- Slave recovery on host 192.168.0.70(192.168.0.70:3306) started, pid: 31323. Check tmp log /var/log/masterha/app1.log/192.168.0.70_3306_20140421201548.log if it takes time..
Mon Apr 21 20:15:52 2014 - [info] 
Mon Apr 21 20:15:52 2014 - [info] Log messages from 192.168.0.70 ...
Mon Apr 21 20:15:52 2014 - [info] 
Mon Apr 21 20:15:52 2014 - [info] Sending binlog..
Mon Apr 21 20:15:52 2014 - [info] scp from local:/var/log/masterha/app1.log/saved_master_binlog_from_192.168.0.50_3306_20140421201548.binlog to root@192.168.0.70:/tmp/saved_master_binlog_from_192.168.0.50_3306_20140421201548.binlog succeeded.
Mon Apr 21 20:15:52 2014 - [info] Starting recovery on 192.168.0.70(192.168.0.70:3306)..
Mon Apr 21 20:15:52 2014 - [info]  Generating diffs succeeded.
Mon Apr 21 20:15:52 2014 - [info] Waiting until all relay logs are applied.
Mon Apr 21 20:15:52 2014 - [info]  done.
Mon Apr 21 20:15:52 2014 - [info] Getting slave status..
Mon Apr 21 20:15:52 2014 - [info] This slave(192.168.0.70)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000018:112). No need to recover from Exec_Master_Log_Pos.
Mon Apr 21 20:15:52 2014 - [info] Connecting to the target slave host 192.168.0.70, running recover script..
Mon Apr 21 20:15:52 2014 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user=root --slave_host=192.168.0.70 --slave_ip=192.168.0.70  --slave_port=3306 --apply_files=/tmp/saved_master_binlog_from_192.168.0.50_3306_20140421201548.binlog --workdir=/tmp --target_version=5.5.19-ndb-7.2.4-gpl-log --timestamp=20140421201548 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53 --slave_pass=xxx
Mon Apr 21 20:15:52 2014 - [info] 
Applying differential binary/relay log files /tmp/saved_master_binlog_from_192.168.0.50_3306_20140421201548.binlog on 192.168.0.70:3306. This may take long time...
Applying log files succeeded.
Mon Apr 21 20:15:52 2014 - [info]  All relay logs were successfully applied.
Mon Apr 21 20:15:52 2014 - [info]  Resetting slave 192.168.0.70(192.168.0.70:3306) and starting replication from the new master 192.168.0.60(192.168.0.60:3306)..
Mon Apr 21 20:15:52 2014 - [info]  Executed CHANGE MASTER.
Mon Apr 21 20:15:52 2014 - [info]  Slave started.
Mon Apr 21 20:15:52 2014 - [info] End of log messages from 192.168.0.70.
Mon Apr 21 20:15:52 2014 - [info] -- Slave recovery on host 192.168.0.70(192.168.0.70:3306) succeeded.
Mon Apr 21 20:15:52 2014 - [info] All new slave servers recovered successfully.
Mon Apr 21 20:15:52 2014 - [info] 
Mon Apr 21 20:15:52 2014 - [info] * Phase 5: New master cleanup phease..
Mon Apr 21 20:15:52 2014 - [info] 
Mon Apr 21 20:15:52 2014 - [info] Resetting slave info on the new master..
Mon Apr 21 20:15:53 2014 - [info]  192.168.0.60: Resetting slave info succeeded.
Mon Apr 21 20:15:53 2014 - [info] Master failover to 192.168.0.60(192.168.0.60:3306) completed successfully.
Mon Apr 21 20:15:53 2014 - [info] Deleted server1 entry from /etc/masterha/app1.cnf .
Mon Apr 21 20:15:53 2014 - [info] 

----- Failover Report -----

app1: MySQL Master failover 192.168.0.50 to 192.168.0.60 succeeded

Master 192.168.0.50 is down!

Check MHA Manager logs at server01:/var/log/masterha/app1/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 192.168.0.50.
The latest slave 192.168.0.60(192.168.0.60:3306) has all relay logs for recovery.
Selected 192.168.0.60 as a new master.
192.168.0.60: OK: Applying all logs succeeded.
192.168.0.60: OK: Activated master IP address.
192.168.0.70: This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
192.168.0.70: OK: Applying all logs succeeded. Slave started, replicating from 192.168.0.60.
192.168.0.60: Resetting slave info succeeded.
Master failover to 192.168.0.60(192.168.0.60:3306) completed successfully.
[root@192.168.0.20 ~]# 
複製代碼

看到最後的Master failover to 192.168.0.60(192.168.0.60:3306) completed successfully.說明備選master現在已經上位了。

從上面的輸出可以看出整個MHA的切換過程,共包括以下的步驟:

1.配置文件檢查階段,這個階段會檢查整個集羣配置文件配置

2.宕機的master處理,這個階段包括虛擬ip摘除操作,主機關機操作(這個我這裏還沒有實現,需要研究)

3.複製dead maste和最新slave相差的relay log,並保存到MHA Manger具體的目錄下

4.識別含有最新更新的slave

5.應用從master保存的二進制日誌事件(binlog events)

6.提升一個slave爲新的master進行復制

7.使其他的slave連接新的master進行復制

最後啓動MHA Manger監控,查看集羣裏面現在誰是master(在切換後監控就停止了。。。還有東西沒搞對?)後來在官方網站看到這句話就明白了 。

Running MHA Manager from daemontools

Currently MHA Manager process does not run as a daemon. If failover completed successfully or the master process was killed by accident, the manager stops working. To run as a daemon, daemontool. or any external daemon program can be used. Here is an example to run from daemontools.
[root@192.168.0.20 ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:23971) is running(0:PING_OK), master:192.168.0.60
[root@192.168.0.20 ~]# 

二.手動Failover(MHA Manager必須沒有運行

手動failover,這種場景意味着在業務上沒有啓用MHA自動切換功能,當主服務器故障時,人工手動調用MHA來進行故障切換操作,具體命令如下:

注意:如果,MHA manager檢測到沒有dead的server,將報錯,並結束failover: 

Mon Apr 21 21:23:33 2014 - [info] Dead Servers:
Mon Apr 21 21:23:33 2014 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln181] None of server is dead. Stop failover.
Mon Apr 21 21:23:33 2014 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln178] Got ERROR:  at /usr/local/bin/masterha_master_switch line 53

進行手動切換命令如下:

[root@192.168.0.20 ~]# masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=192.168.0.50 --dead_master_port=3306 --new_master_host=192.168.0.60 --new_master_port=3306 --ignore_last_failover

輸出的信息會詢問你是否進行切換:

複製代碼
Mon Apr 21 21:28:00 2014 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Mon Apr 21 21:28:00 2014 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Mon Apr 21 21:28:00 2014 - [info] Reading server configurations from /etc/masterha/app1.cnf..
Mon Apr 21 21:28:00 2014 - [info] MHA::MasterFailover version 0.53.
Mon Apr 21 21:28:00 2014 - [info] Starting master failover.
Mon Apr 21 21:28:00 2014 - [info] 
Mon Apr 21 21:28:00 2014 - [info] * Phase 1: Configuration Check Phase..
Mon Apr 21 21:28:00 2014 - [info] 
Mon Apr 21 21:28:00 2014 - [info] Dead Servers:
Mon Apr 21 21:28:00 2014 - [info]   192.168.0.50(192.168.0.50:3306)
Mon Apr 21 21:28:00 2014 - [info] Checking master reachability via mysql(double check)..
Mon Apr 21 21:28:00 2014 - [info]  ok.
Mon Apr 21 21:28:00 2014 - [info] Alive Servers:
Mon Apr 21 21:28:00 2014 - [info]   192.168.0.60(192.168.0.60:3306)
Mon Apr 21 21:28:00 2014 - [info]   192.168.0.70(192.168.0.70:3306)
Mon Apr 21 21:28:00 2014 - [info] Alive Slaves:
Mon Apr 21 21:28:00 2014 - [info]   192.168.0.60(192.168.0.60:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 21:28:00 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 21:28:00 2014 - [info]     Primary candidate for the new Master (candidate_master is set)
Mon Apr 21 21:28:00 2014 - [info]   192.168.0.70(192.168.0.70:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 21:28:00 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Master 192.168.0.50 is dead. Proceed? (yes/NO): yes
Mon Apr 21 21:36:01 2014 - [info] ** Phase 1: Configuration Check Phase completed.
Mon Apr 21 21:36:01 2014 - [info] 
Mon Apr 21 21:36:01 2014 - [info] * Phase 2: Dead Master Shutdown Phase..
Mon Apr 21 21:36:01 2014 - [info] 
Mon Apr 21 21:36:01 2014 - [info] HealthCheck: SSH to 192.168.0.50 is reachable.
Mon Apr 21 21:36:01 2014 - [info] Forcing shutdown so that applications never connect to the current master..
Mon Apr 21 21:36:01 2014 - [info] Executing master IP deactivatation script:
Mon Apr 21 21:36:01 2014 - [info]   /usr/local/bin/master_ip_failover --orig_master_host=192.168.0.50 --orig_master_ip=192.168.0.50 --orig_master_port=3306 --command=stopssh --ssh_user=root  


IN SCRIPT TEST====/sbin/ifconfig eth1:1 down==/sbin/ifconfig eth1:1 192.168.0.88/24===

Disabling the VIP on old master: 192.168.0.50 
Mon Apr 21 21:36:02 2014 - [info]  done.
Mon Apr 21 21:36:02 2014 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Mon Apr 21 21:36:02 2014 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Mon Apr 21 21:36:02 2014 - [info] 
Mon Apr 21 21:36:02 2014 - [info] * Phase 3: Master Recovery Phase..
Mon Apr 21 21:36:02 2014 - [info] 
Mon Apr 21 21:36:02 2014 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Mon Apr 21 21:36:02 2014 - [info] 
Mon Apr 21 21:36:02 2014 - [info] The latest binary log file/position on all slaves is mysql-bin.000020:112
Mon Apr 21 21:36:02 2014 - [info] Latest slaves (Slaves that received relay log files to the latest):
Mon Apr 21 21:36:02 2014 - [info]   192.168.0.60(192.168.0.60:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 21:36:02 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 21:36:02 2014 - [info]     Primary candidate for the new Master (candidate_master is set)
Mon Apr 21 21:36:02 2014 - [info]   192.168.0.70(192.168.0.70:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 21:36:02 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 21:36:02 2014 - [info] The oldest binary log file/position on all slaves is mysql-bin.000020:112
Mon Apr 21 21:36:02 2014 - [info] Oldest slaves:
Mon Apr 21 21:36:02 2014 - [info]   192.168.0.60(192.168.0.60:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 21:36:02 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 21:36:02 2014 - [info]     Primary candidate for the new Master (candidate_master is set)
Mon Apr 21 21:36:02 2014 - [info]   192.168.0.70(192.168.0.70:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Mon Apr 21 21:36:02 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Mon Apr 21 21:36:02 2014 - [info] 
Mon Apr 21 21:36:02 2014 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Mon Apr 21 21:36:02 2014 - [info] 
Mon Apr 21 21:36:02 2014 - [info] Fetching dead master's binary logs..
Mon Apr 21 21:36:02 2014 - [info] Executing command on the dead master 192.168.0.50(192.168.0.50:3306): save_binary_logs --command=save --start_file=mysql-bin.000020  --start_pos=112 --binlog_dir=/data/mysql --output_file=/tmp/saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53
  Creating /tmp if not exists..    ok.
 Concat binary/relay logs from mysql-bin.000020 pos 112 to mysql-bin.000020 EOF into /tmp/saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog ..
  Dumping binlog format description event, from position 0 to 112.. ok.
  Dumping effective binlog data from /data/mysql/mysql-bin.000020 position 112 to tail(131).. ok.
 Concat succeeded.
saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog                                                   100%  131     0.1KB/s   00:00    
Mon Apr 21 21:36:02 2014 - [info] scp from root@192.168.0.50:/tmp/saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog to local:/var/log/masterha/app1.log/saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog succeeded.
Mon Apr 21 21:36:02 2014 - [info] HealthCheck: SSH to 192.168.0.60 is reachable.
Mon Apr 21 21:36:03 2014 - [info] HealthCheck: SSH to 192.168.0.70 is reachable.
Mon Apr 21 21:36:03 2014 - [info] 
Mon Apr 21 21:36:03 2014 - [info] * Phase 3.3: Determining New Master Phase..
Mon Apr 21 21:36:03 2014 - [info] 
Mon Apr 21 21:36:03 2014 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Mon Apr 21 21:36:03 2014 - [info] All slaves received relay logs to the same position. No need to resync each other.
Mon Apr 21 21:36:03 2014 - [info] 192.168.0.60 can be new master.
Mon Apr 21 21:36:03 2014 - [info] New master is 192.168.0.60(192.168.0.60:3306)
Mon Apr 21 21:36:03 2014 - [info] Starting master failover..
Mon Apr 21 21:36:03 2014 - [info] 
From:
192.168.0.50 (current master)
 +--192.168.0.60
 +--192.168.0.70

To:
192.168.0.60 (new master)
 +--192.168.0.70

Starting master switch from 192.168.0.50(192.168.0.50:3306) to 192.168.0.60(192.168.0.60:3306)? (yes/NO): yes
Mon Apr 21 21:36:06 2014 - [info] New master decided manually is 192.168.0.60(192.168.0.60:3306)
Mon Apr 21 21:36:06 2014 - [info] 
Mon Apr 21 21:36:06 2014 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Mon Apr 21 21:36:06 2014 - [info] 
Mon Apr 21 21:36:06 2014 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Mon Apr 21 21:36:06 2014 - [info] Sending binlog..
saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog                                                   100%  131     0.1KB/s   00:00    
Mon Apr 21 21:36:07 2014 - [info] scp from local:/var/log/masterha/app1.log/saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog to root@192.168.0.60:/tmp/saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog succeeded.
Mon Apr 21 21:36:07 2014 - [info] 
Mon Apr 21 21:36:07 2014 - [info] * Phase 3.4: Master Log Apply Phase..
Mon Apr 21 21:36:07 2014 - [info] 
Mon Apr 21 21:36:07 2014 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Mon Apr 21 21:36:07 2014 - [info] Starting recovery on 192.168.0.60(192.168.0.60:3306)..
Mon Apr 21 21:36:07 2014 - [info]  Generating diffs succeeded.
Mon Apr 21 21:36:07 2014 - [info] Waiting until all relay logs are applied.
Mon Apr 21 21:36:07 2014 - [info]  done.
Mon Apr 21 21:36:07 2014 - [info] Getting slave status..
Mon Apr 21 21:36:07 2014 - [info] This slave(192.168.0.60)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000020:112). No need to recover from Exec_Master_Log_Pos.
Mon Apr 21 21:36:07 2014 - [info] Connecting to the target slave host 192.168.0.60, running recover script..
Mon Apr 21 21:36:07 2014 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user=root --slave_host=192.168.0.60 --slave_ip=192.168.0.60  --slave_port=3306 --apply_files=/tmp/saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog --workdir=/tmp --target_version=5.5.19-ndb-7.2.4-gpl-log --timestamp=20140421212800 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53 --slave_pass=xxx
Mon Apr 21 21:36:07 2014 - [info] 
Applying differential binary/relay log files /tmp/saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog on 192.168.0.60:3306. This may take long time...
Applying log files succeeded.
Mon Apr 21 21:36:07 2014 - [info]  All relay logs were successfully applied.
Mon Apr 21 21:36:07 2014 - [info] Getting new master's binlog name and position..
Mon Apr 21 21:36:07 2014 - [info]  mysql-bin.000022:506716
Mon Apr 21 21:36:07 2014 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.0.60', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000022', MASTER_LOG_POS=506716, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Mon Apr 21 21:36:07 2014 - [info] Executing master IP activate script:
Mon Apr 21 21:36:07 2014 - [info]   /usr/local/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.0.50 --orig_master_ip=192.168.0.50 --orig_master_port=3306 --new_master_host=192.168.0.60 --new_master_ip=192.168.0.60 --new_master_port=3306  


IN SCRIPT TEST====/sbin/ifconfig eth1:1 down==/sbin/ifconfig eth1:1 192.168.0.88/24===

Enabling the VIP - 192.168.0.88/24 on the new master - 192.168.0.60 
Mon Apr 21 21:36:08 2014 - [info]  OK.
Mon Apr 21 21:36:08 2014 - [info] Setting read_only=0 on 192.168.0.60(192.168.0.60:3306)..
Mon Apr 21 21:36:08 2014 - [info]  ok.
Mon Apr 21 21:36:08 2014 - [info] ** Finished master recovery successfully.
Mon Apr 21 21:36:08 2014 - [info] * Phase 3: Master Recovery Phase completed.
Mon Apr 21 21:36:08 2014 - [info] 
Mon Apr 21 21:36:08 2014 - [info] * Phase 4: Slaves Recovery Phase..
Mon Apr 21 21:36:08 2014 - [info] 
Mon Apr 21 21:36:08 2014 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Mon Apr 21 21:36:08 2014 - [info] 
Mon Apr 21 21:36:08 2014 - [info] -- Slave diff file generation on host 192.168.0.70(192.168.0.70:3306) started, pid: 33518. Check tmp log /var/log/masterha/app1.log/192.168.0.70_3306_20140421212800.log if it takes time..
Mon Apr 21 21:36:08 2014 - [info] 
Mon Apr 21 21:36:08 2014 - [info] Log messages from 192.168.0.70 ...
Mon Apr 21 21:36:08 2014 - [info] 
Mon Apr 21 21:36:08 2014 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Mon Apr 21 21:36:08 2014 - [info] End of log messages from 192.168.0.70.
Mon Apr 21 21:36:08 2014 - [info] -- 192.168.0.70(192.168.0.70:3306) has the latest relay log events.
Mon Apr 21 21:36:08 2014 - [info] Generating relay diff files from the latest slave succeeded.
Mon Apr 21 21:36:08 2014 - [info] 
Mon Apr 21 21:36:08 2014 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Mon Apr 21 21:36:08 2014 - [info] 
Mon Apr 21 21:36:08 2014 - [info] -- Slave recovery on host 192.168.0.70(192.168.0.70:3306) started, pid: 33520. Check tmp log /var/log/masterha/app1.log/192.168.0.70_3306_20140421212800.log if it takes time..
saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog                                                   100%  131     0.1KB/s   00:00    
Mon Apr 21 21:36:09 2014 - [info] 
Mon Apr 21 21:36:09 2014 - [info] Log messages from 192.168.0.70 ...
Mon Apr 21 21:36:09 2014 - [info] 
Mon Apr 21 21:36:08 2014 - [info] Sending binlog..
Mon Apr 21 21:36:08 2014 - [info] scp from local:/var/log/masterha/app1.log/saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog to root@192.168.0.70:/tmp/saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog succeeded.
Mon Apr 21 21:36:08 2014 - [info] Starting recovery on 192.168.0.70(192.168.0.70:3306)..
Mon Apr 21 21:36:08 2014 - [info]  Generating diffs succeeded.
Mon Apr 21 21:36:08 2014 - [info] Waiting until all relay logs are applied.
Mon Apr 21 21:36:08 2014 - [info]  done.
Mon Apr 21 21:36:08 2014 - [info] Getting slave status..
Mon Apr 21 21:36:08 2014 - [info] This slave(192.168.0.70)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000020:112). No need to recover from Exec_Master_Log_Pos.
Mon Apr 21 21:36:08 2014 - [info] Connecting to the target slave host 192.168.0.70, running recover script..
Mon Apr 21 21:36:08 2014 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user=root --slave_host=192.168.0.70 --slave_ip=192.168.0.70  --slave_port=3306 --apply_files=/tmp/saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog --workdir=/tmp --target_version=5.5.19-ndb-7.2.4-gpl-log --timestamp=20140421212800 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53 --slave_pass=xxx
Mon Apr 21 21:36:09 2014 - [info] 
Applying differential binary/relay log files /tmp/saved_master_binlog_from_192.168.0.50_3306_20140421212800.binlog on 192.168.0.70:3306. This may take long time...
Applying log files succeeded.
Mon Apr 21 21:36:09 2014 - [info]  All relay logs were successfully applied.
Mon Apr 21 21:36:09 2014 - [info]  Resetting slave 192.168.0.70(192.168.0.70:3306) and starting replication from the new master 192.168.0.60(192.168.0.60:3306)..
Mon Apr 21 21:36:09 2014 - [info]  Executed CHANGE MASTER.
Mon Apr 21 21:36:09 2014 - [info]  Slave started.
Mon Apr 21 21:36:09 2014 - [info] End of log messages from 192.168.0.70.
Mon Apr 21 21:36:09 2014 - [info] -- Slave recovery on host 192.168.0.70(192.168.0.70:3306) succeeded.
Mon Apr 21 21:36:09 2014 - [info] All new slave servers recovered successfully.
Mon Apr 21 21:36:09 2014 - [info] 
Mon Apr 21 21:36:09 2014 - [info] * Phase 5: New master cleanup phease..
Mon Apr 21 21:36:09 2014 - [info] 
Mon Apr 21 21:36:09 2014 - [info] Resetting slave info on the new master..
Mon Apr 21 21:36:09 2014 - [info]  192.168.0.60: Resetting slave info succeeded.
Mon Apr 21 21:36:09 2014 - [info] Master failover to 192.168.0.60(192.168.0.60:3306) completed successfully.
Mon Apr 21 21:36:09 2014 - [info] 

----- Failover Report -----

app1: MySQL Master failover 192.168.0.50 to 192.168.0.60 succeeded

Master 192.168.0.50 is down!

Check MHA Manager logs at server01 for details.

Started manual(interactive) failover.
Invalidated master IP address on 192.168.0.50.
The latest slave 192.168.0.60(192.168.0.60:3306) has all relay logs for recovery.
Selected 192.168.0.60 as a new master.
192.168.0.60: OK: Applying all logs succeeded.
192.168.0.60: OK: Activated master IP address.
192.168.0.70: This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
192.168.0.70: OK: Applying all logs succeeded. Slave started, replicating from 192.168.0.60.
192.168.0.60: Resetting slave info succeeded.
Master failover to 192.168.0.60(192.168.0.60:3306) completed successfully.
複製代碼

上述模擬了master宕機的情況下手動把192.168.0.60提升爲主庫的操作過程。

三.在線進行切換

 在許多情況下, 需要將現有的主服務器遷移到另外一臺服務器上。 比如主服務器硬件故障,RAID 控制卡需要重建,將主服務器移到性能更好的服務器上等等。維護主服務器引起性能下降, 導致停機時間至少無法寫入數據。 另外, 阻塞或殺掉當前運行的會話會導致主主之間數據不一致的問題發生。 MHA 提供快速切換和優雅的阻塞寫入,這個切換過程只需要 0.5-2s 的時間,這段時間內數據是無法寫入的。在很多情況下,0.5-2s 的阻塞寫入是可以接受的。因此切換主服務器不需要計劃分配維護時間窗口。

MHA在線切換的大概過程:
1.檢測複製設置和確定當前主服務器
2.確定新的主服務器
3.阻塞寫入到當前主服務器
4.等待所有從服務器趕上覆制
5.授予寫入到新的主服務器
6.重新設置從服務器 

注意,在線切換的時候應用架構需要考慮以下兩個問題:

1.自動識別master和slave的問題(master的機器可能會切換),如果採用了vip的方式,基本可以解決這個問題。

2.負載均衡的問題(可以定義大概的讀寫比例,每臺機器可承擔的負載比例,當有機器離開集羣時,需要考慮這個問題)

爲了保證數據完全一致性,在最快的時間內完成切換,MHA的在線切換必須滿足以下條件纔會切換成功,否則會切換失敗。

1.所有slave的IO線程都在運行

2.所有slave的SQL線程都在運行

3.所有的show slave status的輸出中Seconds_Behind_Master參數小於或者等於running_updates_limit秒,如果在切換過程中不指定running_updates_limit,那麼默認情況下running_updates_limit爲1秒。

4.在master端,通過show processlist輸出,沒有一個更新花費的時間大於running_updates_limit秒。

在線切換步驟如下:

首先,停掉MHA監控:

[root@192.168.0.20 ~]# masterha_stop --conf=/etc/masterha/app1.cnf

其次,進行在線切換操作(模擬在線切換主庫操作,原主庫192.168.0.50變爲slave,192.168.0.60提升爲新的主庫)

[root@192.168.0.20 ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=192.168.0.60 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000

最後查看日誌,瞭解切換過程,輸出信息如下:

複製代碼
[root@192.168.0.20 ~]#  masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=192.168.0.60 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
Wed Apr 23 00:27:39 2014 - [info] MHA::MasterRotate version 0.53.
Wed Apr 23 00:27:39 2014 - [info] Starting online master switch..
Wed Apr 23 00:27:39 2014 - [info] 
Wed Apr 23 00:27:39 2014 - [info] * Phase 1: Configuration Check Phase..
Wed Apr 23 00:27:39 2014 - [info] 
Wed Apr 23 00:27:39 2014 - [info] Reading default configuratoins from /etc/masterha_default.cnf..
Wed Apr 23 00:27:39 2014 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Wed Apr 23 00:27:39 2014 - [info] Reading server configurations from /etc/masterha/app1.cnf..
Wed Apr 23 00:27:39 2014 - [info] Multi-master configuration is detected. Current primary(writable) master is 192.168.0.50(192.168.0.50:3306)
Wed Apr 23 00:27:39 2014 - [info] Master configurations are as below: 
Master 192.168.0.60(192.168.0.60:3306), replicating from 192.168.0.50(192.168.0.50:3306), read-only
Master 192.168.0.50(192.168.0.50:3306), replicating from 192.168.0.60(192.168.0.60:3306)

Wed Apr 23 00:27:39 2014 - [info] Current Alive Master: 192.168.0.50(192.168.0.50:3306)
Wed Apr 23 00:27:39 2014 - [info] Alive Slaves:
Wed Apr 23 00:27:39 2014 - [info]   192.168.0.60(192.168.0.60:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Wed Apr 23 00:27:39 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)
Wed Apr 23 00:27:39 2014 - [info]     Primary candidate for the new Master (candidate_master is set)
Wed Apr 23 00:27:39 2014 - [info]   192.168.0.70(192.168.0.70:3306)  Version=5.5.19-ndb-7.2.4-gpl-log (oldest major version between slaves) log-bin:enabled
Wed Apr 23 00:27:39 2014 - [info]     Replicating from 192.168.0.50(192.168.0.50:3306)

It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.0.50(192.168.0.50:3306)? (YES/no): yes
Wed Apr 23 00:27:40 2014 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
Wed Apr 23 00:27:40 2014 - [info]  ok.
Wed Apr 23 00:27:40 2014 - [info] Checking MHA is not monitoring or doing failover..
Wed Apr 23 00:27:40 2014 - [info] Checking replication health on 192.168.0.60..
Wed Apr 23 00:27:40 2014 - [info]  ok.
Wed Apr 23 00:27:40 2014 - [info] Checking replication health on 192.168.0.70..
Wed Apr 23 00:27:40 2014 - [info]  ok.
Wed Apr 23 00:27:40 2014 - [info] 192.168.0.60 can be new master.
Wed Apr 23 00:27:40 2014 - [info] 
From:
192.168.0.50 (current master)
 +--192.168.0.60
 +--192.168.0.70

To:
192.168.0.60 (new master)
 +--192.168.0.70
 +--192.168.0.50

Starting master switch from 192.168.0.50(192.168.0.50:3306) to 192.168.0.60(192.168.0.60:3306)? (yes/NO): yes
Wed Apr 23 00:27:41 2014 - [info] Checking whether 192.168.0.60(192.168.0.60:3306) is ok for the new master..
Wed Apr 23 00:27:41 2014 - [info]  ok.
Wed Apr 23 00:27:41 2014 - [info] ** Phase 1: Configuration Check Phase completed.
Wed Apr 23 00:27:41 2014 - [info] 
Wed Apr 23 00:27:41 2014 - [info] * Phase 2: Rejecting updates Phase..
Wed Apr 23 00:27:41 2014 - [info] 
Wed Apr 23 00:27:41 2014 - [info] Executing master ip online change script to disable write on the current master:
Wed Apr 23 00:27:41 2014 - [info]   /usr/local/bin/master_ip_online_change.pl --command=stop --orig_master_host=192.168.0.50 --orig_master_ip=192.168.0.50 --orig_master_port=3306 --new_master_host=192.168.0.60 --new_master_ip=192.168.0.60 --new_master_port=3306  
Wed Apr 23 00:27:41 2014 714804 Set read_only on the new master.. ok.
Wed Apr 23 00:27:41 2014 719969 Set read_only=1 on the orig master.. ok.
Disabling the VIP on old master: 192.168.0.50 
reverse mapping checking getaddrinfo for bogon [192.168.0.50] failed - POSSIBLE BREAK-IN ATTEMPT!
Wed Apr 23 00:27:51 2014 963762 Killing all application threads..
Wed Apr 23 00:27:51 2014 963869 done.
Wed Apr 23 00:27:51 2014 - [info]  ok.
Wed Apr 23 00:27:51 2014 - [info] Locking all tables on the orig master to reject updates from everybody (including root):
Wed Apr 23 00:27:51 2014 - [info] Executing FLUSH TABLES WITH READ LOCK..
Wed Apr 23 00:27:51 2014 - [info]  ok.
Wed Apr 23 00:27:51 2014 - [info] Orig master binlog:pos is mysql-bin.000028:112.
Wed Apr 23 00:27:51 2014 - [info]  Waiting to execute all relay logs on 192.168.0.60(192.168.0.60:3306)..
Wed Apr 23 00:27:51 2014 - [info]  master_pos_wait(mysql-bin.000028:112) completed on 192.168.0.60(192.168.0.60:3306). Executed 0 events.
Wed Apr 23 00:27:51 2014 - [info]   done.
Wed Apr 23 00:27:51 2014 - [info] Getting new master's binlog name and position..
Wed Apr 23 00:27:51 2014 - [info]  mysql-bin.000023:1550
Wed Apr 23 00:27:51 2014 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.0.60', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000023', MASTER_LOG_POS=1550, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Wed Apr 23 00:27:51 2014 - [info] Executing master ip online change script to allow write on the new master:
Wed Apr 23 00:27:51 2014 - [info]   /usr/local/bin/master_ip_online_change.pl --command=start --orig_master_host=192.168.0.50 --orig_master_ip=192.168.0.50 --orig_master_port=3306 --new_master_host=192.168.0.60 --new_master_ip=192.168.0.60 --new_master_port=3306  
Wed Apr 23 00:27:52 2014 077334 Set read_only=0 on the new master.
Enabling the VIP - 192.168.0.88/24 on the new master - 192.168.0.60 
reverse mapping checking getaddrinfo for bogon [192.168.0.60] failed - POSSIBLE BREAK-IN ATTEMPT!
Wed Apr 23 00:28:02 2014 - [info]  ok.
Wed Apr 23 00:28:02 2014 - [info] 
Wed Apr 23 00:28:02 2014 - [info] * Switching slaves in parallel..
Wed Apr 23 00:28:02 2014 - [info] 
Wed Apr 23 00:28:02 2014 - [info] -- Slave switch on host 192.168.0.70(192.168.0.70:3306) started, pid: 3036
Wed Apr 23 00:28:02 2014 - [info] 
Wed Apr 23 00:28:02 2014 - [info] Log messages from 192.168.0.70 ...
Wed Apr 23 00:28:02 2014 - [info] 
Wed Apr 23 00:28:02 2014 - [info]  Waiting to execute all relay logs on 192.168.0.70(192.168.0.70:3306)..
Wed Apr 23 00:28:02 2014 - [info]  master_pos_wait(mysql-bin.000028:112) completed on 192.168.0.70(192.168.0.70:3306). Executed 0 events.
Wed Apr 23 00:28:02 2014 - [info]   done.
Wed Apr 23 00:28:02 2014 - [info]  Resetting slave 192.168.0.70(192.168.0.70:3306) and starting replication from the new master 192.168.0.60(192.168.0.60:3306)..
Wed Apr 23 00:28:02 2014 - [info]  Executed CHANGE MASTER.
Wed Apr 23 00:28:02 2014 - [info]  Slave started.
Wed Apr 23 00:28:02 2014 - [info] End of log messages from 192.168.0.70 ...
Wed Apr 23 00:28:02 2014 - [info] 
Wed Apr 23 00:28:02 2014 - [info] -- Slave switch on host 192.168.0.70(192.168.0.70:3306) succeeded.
Wed Apr 23 00:28:02 2014 - [info] Unlocking all tables on the orig master:
Wed Apr 23 00:28:02 2014 - [info] Executing UNLOCK TABLES..
Wed Apr 23 00:28:02 2014 - [info]  ok.
Wed Apr 23 00:28:02 2014 - [info] Starting orig master as a new slave..
Wed Apr 23 00:28:02 2014 - [info]  Resetting slave 192.168.0.50(192.168.0.50:3306) and starting replication from the new master 192.168.0.60(192.168.0.60:3306)..
Wed Apr 23 00:28:02 2014 - [info]  Executed CHANGE MASTER.
Wed Apr 23 00:28:02 2014 - [info]  Slave started.
Wed Apr 23 00:28:02 2014 - [info] All new slave servers switched successfully.
Wed Apr 23 00:28:02 2014 - [info] 
Wed Apr 23 00:28:02 2014 - [info] * Phase 5: New master cleanup phease..
Wed Apr 23 00:28:02 2014 - [info] 
Wed Apr 23 00:28:02 2014 - [info]  192.168.0.60: Resetting slave info succeeded.
Wed Apr 23 00:28:02 2014 - [info] Switching master to 192.168.0.60(192.168.0.60:3306) completed successfully.
複製代碼

其中參數的意思:

--orig_master_is_new_slave 切換時加上此參數是將原 master 變爲 slave 節點,如果不加此參數,原來的 master 將不啓動

--running_updates_limit=10000,故障切換時,候選master 如果有延遲的話, mha 切換不能成功,加上此參數表示延遲在此時間範圍內都可切換(單位爲s),但是切換的時間長短是由recover 時relay 日誌的大小決定 

注意:由於在線進行切換需要調用到master_ip_online_change這個腳本,但是由於該腳本不完整,需要自己進行相應的修改,我google到後發現還是有問題,腳本中new_master_password這個變量獲取不到,導致在線切換失敗,所以進行了相關的硬編碼,直接把mysql的root用戶密碼賦值給變量new_master_password,如果有哪位大牛知道原因,請指點指點。這個腳本還可以管理vip。下面貼出腳本:

複製代碼
#!/usr/bin/env perl

#  Copyright (C) 2011 DeNA Co.,Ltd.
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#  Foundation, Inc.,
#  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA

## Note: This is a sample script and is not complete. Modify the script based on your environment.

use strict;
use warnings FATAL => 'all';

use Getopt::Long;
use MHA::DBHelper;
use MHA::NodeUtil;
use Time::HiRes qw( sleep gettimeofday tv_interval );
use Data::Dumper;

my $_tstart;
my $_running_interval = 0.1;
my (
  $command,          $orig_master_host, $orig_master_ip,
  $orig_master_port, $orig_master_user, 
  $new_master_host,  $new_master_ip,    $new_master_port,
  $new_master_user,  
);


my $vip = '192.168.0.88/24';  # Virtual IP 
my $key = "1"; 
my $ssh_start_vip = "/sbin/ifconfig eth1:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig eth1:$key down";
my $ssh_user = "root";
my $new_master_password='123456';
my $orig_master_password='123456';
GetOptions(
  'command=s'              => \$command,
  #'ssh_user=s'             => \$ssh_user,  
  'orig_master_host=s'     => \$orig_master_host,
  'orig_master_ip=s'       => \$orig_master_ip,
  'orig_master_port=i'     => \$orig_master_port,
  'orig_master_user=s'     => \$orig_master_user,
  #'orig_master_password=s' => \$orig_master_password,
  'new_master_host=s'      => \$new_master_host,
  'new_master_ip=s'        => \$new_master_ip,
  'new_master_port=i'      => \$new_master_port,
  'new_master_user=s'      => \$new_master_user,
  #'new_master_password=s'  => \$new_master_password,
);

exit &main();

sub current_time_us {
  my ( $sec, $microsec ) = gettimeofday();
  my $curdate = localtime($sec);
  return $curdate . " " . sprintf( "%06d", $microsec );
}

sub sleep_until {
  my $elapsed = tv_interval($_tstart);
  if ( $_running_interval > $elapsed ) {
    sleep( $_running_interval - $elapsed );
  }
}

sub get_threads_util {
  my $dbh                    = shift;
  my $my_connection_id       = shift;
  my $running_time_threshold = shift;
  my $type                   = shift;
  $running_time_threshold = 0 unless ($running_time_threshold);
  $type                   = 0 unless ($type);
  my @threads;

  my $sth = $dbh->prepare("SHOW PROCESSLIST");
  $sth->execute();

  while ( my $ref = $sth->fetchrow_hashref() ) {
    my $id         = $ref->{Id};
    my $user       = $ref->{User};
    my $host       = $ref->{Host};
    my $command    = $ref->{Command};
    my $state      = $ref->{State};
    my $query_time = $ref->{Time};
    my $info       = $ref->{Info};
    $info =~ s/^\s*(.*?)\s*$/$1/ if defined($info);
    next if ( $my_connection_id == $id );
    next if ( defined($query_time) && $query_time < $running_time_threshold );
    next if ( defined($command)    && $command eq "Binlog Dump" );
    next if ( defined($user)       && $user eq "system user" );
    next
      if ( defined($command)
      && $command eq "Sleep"
      && defined($query_time)
      && $query_time >= 1 );

    if ( $type >= 1 ) {
      next if ( defined($command) && $command eq "Sleep" );
      next if ( defined($command) && $command eq "Connect" );
    }

    if ( $type >= 2 ) {
      next if ( defined($info) && $info =~ m/^select/i );
      next if ( defined($info) && $info =~ m/^show/i );
    }

    push @threads, $ref;
  }
  return @threads;
}

sub main {
  if ( $command eq "stop" ) {
    ## Gracefully killing connections on the current master
    # 1. Set read_only= 1 on the new master
    # 2. DROP USER so that no app user can establish new connections
    # 3. Set read_only= 1 on the current master
    # 4. Kill current queries
    # * Any database access failure will result in script die.
    my $exit_code = 1;
    eval {
      ## Setting read_only=1 on the new master (to avoid accident)
      my $new_master_handler = new MHA::DBHelper();

      # args: hostname, port, user, password, raise_error(die_on_error)_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );
      print current_time_us() . " Set read_only on the new master.. ";
      $new_master_handler->enable_read_only();
      if ( $new_master_handler->is_read_only() ) {
        print "ok.\n";
      }
      else {
        die "Failed!\n";
      }
      $new_master_handler->disconnect();

      # Connecting to the orig master, die if any database error happens
      my $orig_master_handler = new MHA::DBHelper();
      $orig_master_handler->connect( $orig_master_ip, $orig_master_port,
        $orig_master_user, $orig_master_password, 1 );

      ## Drop application user so that nobody can connect. Disabling per-session binlog beforehand
      #$orig_master_handler->disable_log_bin_local();
      #print current_time_us() . " Drpping app user on the orig master..\n";
      #FIXME_xxx_drop_app_user($orig_master_handler);

      ## Waiting for N * 100 milliseconds so that current connections can exit
      my $time_until_read_only = 15;
      $_tstart = [gettimeofday];
      my @threads = get_threads_util( $orig_master_handler->{dbh},
        $orig_master_handler->{connection_id} );
      while ( $time_until_read_only > 0 && $#threads >= 0 ) {
        if ( $time_until_read_only % 5 == 0 ) {
          printf
"%s Waiting all running %d threads are disconnected.. (max %d milliseconds)\n",
            current_time_us(), $#threads + 1, $time_until_read_only * 100;
          if ( $#threads < 5 ) {
            print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
              foreach (@threads);
          }
        }
        sleep_until();
        $_tstart = [gettimeofday];
        $time_until_read_only--;
        @threads = get_threads_util( $orig_master_handler->{dbh},
          $orig_master_handler->{connection_id} );
      }

      ## Setting read_only=1 on the current master so that nobody(except SUPER) can write
      print current_time_us() . " Set read_only=1 on the orig master.. ";
      $orig_master_handler->enable_read_only();
      if ( $orig_master_handler->is_read_only() ) {
        print "ok.\n";
      }
      else {
        die "Failed!\n";
      }

      ## Waiting for M * 100 milliseconds so that current update queries can complete
      my $time_until_kill_threads = 5;
      @threads = get_threads_util( $orig_master_handler->{dbh},
        $orig_master_handler->{connection_id} );
      while ( $time_until_kill_threads > 0 && $#threads >= 0 ) {
        if ( $time_until_kill_threads % 5 == 0 ) {
          printf
"%s Waiting all running %d queries are disconnected.. (max %d milliseconds)\n",
            current_time_us(), $#threads + 1, $time_until_kill_threads * 100;
          if ( $#threads < 5 ) {
            print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
              foreach (@threads);
          }
        }
        sleep_until();
        $_tstart = [gettimeofday];
        $time_until_kill_threads--;
        @threads = get_threads_util( $orig_master_handler->{dbh},
          $orig_master_handler->{connection_id} );
      }



                print "Disabling the VIP on old master: $orig_master_host \n";
                &stop_vip();     


      ## Terminating all threads
      print current_time_us() . " Killing all application threads..\n";
      $orig_master_handler->kill_threads(@threads) if ( $#threads >= 0 );
      print current_time_us() . " done.\n";
      #$orig_master_handler->enable_log_bin_local();
      $orig_master_handler->disconnect();

      ## After finishing the script, MHA executes FLUSH TABLES WITH READ LOCK
      $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "start" ) {
    ## Activating master ip on the new master
    # 1. Create app user with write privileges
    # 2. Moving backup script if needed
    # 3. Register new master's ip to the catalog database

# We don't return error even though activating updatable accounts/ip failed so that we don't interrupt slaves' recovery.
# If exit code is 0 or 10, MHA does not abort
    my $exit_code = 10;
    eval {
      my $new_master_handler = new MHA::DBHelper();

      # args: hostname, port, user, password, raise_error_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );

      ## Set read_only=0 on the new master
      #$new_master_handler->disable_log_bin_local();
      print current_time_us() . " Set read_only=0 on the new master.\n";
      $new_master_handler->disable_read_only();

      ## Creating an app user on the new master
      #print current_time_us() . " Creating app user on the new master..\n";
      #FIXME_xxx_create_app_user($new_master_handler);
      #$new_master_handler->enable_log_bin_local();
      $new_master_handler->disconnect();

      ## Update master ip on the catalog database, etc
                print "Enabling the VIP - $vip on the new master - $new_master_host \n";
                &start_vip();
                $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "status" ) {

    # do nothing
    exit 0;
  }
  else {
    &usage();
    exit 1;
  }
}

# A simple system call that enable the VIP on the new master 
sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
# A simple system call that disable the VIP on the old_master
sub stop_vip() {
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
  print
"Usage: master_ip_online_change --command=start|stop|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
  die;
}
複製代碼

四.修復宕機的Master 

通常情況下自動切換以後,原master可能已經廢棄掉,待原master主機修復後,如果數據完整的情況下,可能想把原來master重新作爲新主庫的slave,這時我們可以藉助當時自動切換時刻的MHA日誌來完成對原master的修復。下面是提取相關日誌的命令:

[root@192.168.0.20 app1]# grep -i "All other slaves should start" manager.log 
Mon Apr 21 22:28:33 2014 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.0.60', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000022', MASTER_LOG_POS=506716, MASTER_USER='repl', MASTER_PASSWORD='xxx';
[root@192.168.0.20 app1]# 

獲取上述信息以後,就可以直接在修復後的master上執行change master to相關操作,重新作爲從庫了。

最後補充一下郵件發送腳本send_report ,這個腳本在詢問一位朋友後可以使用,如下:

複製代碼
#!/usr/bin/perl

#  Copyright (C) 2011 DeNA Co.,Ltd.
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#  Foundation, Inc.,
#  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA

## Note: This is a sample script and is not complete. Modify the script based on your environment.

use strict;
use warnings FATAL => 'all';
use Mail::Sender;
use Getopt::Long;

#new_master_host and new_slave_hosts are set only when recovering master succeeded
my ( $dead_master_host, $new_master_host, $new_slave_hosts, $subject, $body );
my $smtp='smtp.163.com';
my $mail_from='xxxx';
my $mail_user='xxxxx';
my $mail_pass='xxxxx';
my $mail_to=['xxxx','xxxx'];
GetOptions(
  'orig_master_host=s' => \$dead_master_host,
  'new_master_host=s'  => \$new_master_host,
  'new_slave_hosts=s'  => \$new_slave_hosts,
  'subject=s'          => \$subject,
  'body=s'             => \$body,
);

mailToContacts($smtp,$mail_from,$mail_user,$mail_pass,$mail_to,$subject,$body);

sub mailToContacts {
    my ( $smtp, $mail_from, $user, $passwd, $mail_to, $subject, $msg ) = @_;
    open my $DEBUG, "> /tmp/monitormail.log"
        or die "Can't open the debug      file:$!\n";
    my $sender = new Mail::Sender {
        ctype       => 'text/plain; charset=utf-8',
        encoding    => 'utf-8',
        smtp        => $smtp,
        from        => $mail_from,
        auth        => 'LOGIN',
        TLS_allowed => '0',
        authid      => $user,
        authpwd     => $passwd,
        to          => $mail_to,
        subject     => $subject,
        debug       => $DEBUG
    };

    $sender->MailMsg(
        {   msg   => $msg,
            debug => $DEBUG
        }
    ) or print $Mail::Sender::Error;
    return 1;
}



# Do whatever you want here

exit 0;
複製代碼

最後切換以後發送告警的郵件示例,注意,這個是我後續的測試,和上面環境出現的ip不一致不要在意。

 

總結:

目前高可用方案可以一定程度上實現數據庫的高可用,比如前面文章介紹的MMMheartbeat+drbdCluster等。還有percona的Galera Cluster等。這些高可用軟件各有優劣。在進行高可用方案選擇時,主要是看業務還有對數據一致性方面的要求。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章