前言
MySQL
作爲數據存儲工具,可以說是整個架構體系中最重要的一環都不爲過。無論是怎樣的架構,怎樣的設計,都不能離開關係型數據庫。如果數據庫故障了,整個系統肯定是不可用的,所以MySQL
的高可用非常重要。本篇主要從理論上講解常見的MySQL高可用架構MMM
和MHA
,以及從零開始,一步步搭建一個高可用的MHA
架構。
前置閱讀
本篇內容是基於上一篇來擴展的,所以請務必閱讀一下這篇前置閱讀。
高可用
MySQL
主從複製常見的高可用架構有兩種,MMM
和MHA
。想要實現MySQL
主從複製的高可用,需要實現以下幾點功能
- 對主從複製集羣中的
master
節點進行監控 - 當
master
節點宕機後把VIP
(Virtual IP Address,即虛擬IP)遷移到新的master
節點 - 重新配置集羣中的其他
slave
節點從新的master
同步
MMM架構
MMM
(Master-Master replication manager for MySQL)是一套支持雙主故障切換和雙主日常管理的腳本程序。主要用來監控和管理Master-Master
(雙主)複製,雖然叫做雙主複製,但是同一個時刻只有一個master
,另一個作爲master
的備份,以加速在主主切換時刻備選master
的預熱,一方面實現了故障自動切換的功能,另一方面也可以實現多個slave
的讀負載均衡。
MMM
的整體架構圖如下所示
結合MMM
的架構圖,我們可以知道
- 整個架構需要兩個
master
節點,兩個master
節點互爲主備。同一時刻,只能有一個master
對外提供服務 - 可以有多個用於讀操作的
slave
節點 - 給
master
分配一個VIP
,只能在主備之間切換;給每個slave
節點分配一個讀VIP
,可以在任意slave
節點上切換
當master
宕機時,MMM
管理工具會把所有的slave
節點切換爲主備的slave
。並且把寫VIP
遷移到主備服務器上,slave
節點從新的master
節點上同步數據,整個過程簡單粗暴,所以無法保證數據的一致性。
當slave
節點宕機時,MMM
管理工具會把讀VIP
遷移到其他slave
節點,slave
節點可以有多個VIP
。
MMM架構缺點
- 故障切換簡單粗暴,容易丟失事務(可以採用半同步複製改進)
- 不支持
GTID
的複製方式(可以自行修改perl
腳本) - 社區不活躍,很久沒更新新版本
MHA架構
MHA
(Master High Availability)是一款開源的 MySQL 的高可用程序。MHA
管理工具在監控到master
節點故障時,會提升擁有master
最新數據的slave
節點成爲新的master
節點,並且會讓其他的slave
節點從新的master
節點上同步數據。MHA
還提供了master
節點的在線切換功能,即按需切換master/slave
節點。
MHA
的架構圖如下所示
從MHA
架構圖可以知道,MHA
只監控master
的健康狀態,當master
宕機時,MHA
管理工具會從master
所有的slave
中選出一個最接近master
的節點提升爲新的master
。
MHA故障遷移
MHA
管理下的MySQL
主從複製,master
故障之後,會經過如下過程進行故障轉移
- 移除宕機的
master
的VIP
,挑選具有最新數據的slave
- 嘗試從宕機的
master
保存二進制日誌(如果僅僅是MySQL
實例宕機,則有可能成功) - 應用差異的中繼日誌(
relay log
)到其它slave
,因爲作爲備選master
的節點的中繼日誌,和其它slave
節點的中繼日誌可能有差異,所以要把備選master
節點的中繼日誌應用到其它slave
節點 - 應用從
master
保存的二進制日誌(如果第二步成功) - 把備選的
master
提升爲新的master
- 配置其他的
slave
從新的master
同步,把寫VIP
遷移到新的master
上
MHA優勢
- 支持
GTID
的複製方式和基於日誌點的複製方式 - 可以從多個
slave
節點中選取最適合的master
- 會嘗試從舊的
master
中儘可能保存更多的未同步日誌
MHA不足
- 不一定能獲取到原
master
的未同步日誌 - 需要自行開發寫
VIP
轉移腳本 - 只監控
master
,而沒有對slave
實現高可用
MHA適用場景
- 適用基於
GTID
的複製方式 - 使用一主多從的複製架構
- 希望更少的數據丟失場景
搭建MHA
整個MHA的搭建過程雖然不算複雜,但是涉及到的步驟較多,建議先整體閱讀一下,再動手實踐。
配置節點間SSH免密
首先在master
節點(192.168.1.101
)上執行,一路回車即可
ssh-keygen
執行結果如下
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:HFpSaM7IVW+TjQVUM0m1JBNgrnhH85O3wuur58sev1E [email protected]
The key's randomart image is:
+---[RSA 2048]----+
| oo.==X+o |
| +. + =.* . |
| . *. o X . . |
| o o* = + . |
| o S . + . E|
| . . . o o |
| + o |
| ..= . |
| .*Ooo. |
+----[SHA256]-----+
把生成的/root/.ssh/id_rsa
拷貝到三個節點上(包括自己)
ssh-copy-id -i /root/.ssh/id_rsa [email protected]
ssh-copy-id -i /root/.ssh/id_rsa [email protected]
ssh-copy-id -i /root/.ssh/id_rsa [email protected]
執行完成後,在192.168.1.101
使用ssh
命令連接到102
和103
上是不需要密碼的
ssh 192.168.1.102
上述操作需要在192.168.1.102
和192.168.1.103
上均執行一遍
安裝yum擴展包
- 下載
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
- 安裝
rpm -ivh epel-release-latest-7.noarch.rpm
- 修改
只需要修改一項內容,就是vim /etc/yum.repos.d/epel.repo
epel
節點下的gpgcheck
[epel] ... ## 只需要修改epel節點下的gpgcheck屬性 gpgcheck=0 ...
上述操作需要在所有節點上均執行一遍
安裝依賴
執行如下命令
yum -y install perl-DBD-MySQL ncftp perl-DBI.x86
上述操作需要在所有節點上均執行一遍
安裝MHA管理工具
下載地址:https://download.csdn.net/download/Baisitao_/12505957
## 安裝mha-node
rpm -ivh mha4mysql-node-0.57-0.el7.noarch.rpm
上述操作需要在所有節點上均執行一遍
安裝管理節點
嚴格來說,監控工具應該安裝在一個單獨的節點,此處爲了節約一個節點,就安裝在192.168.1.103
上。
yum -y install perl-Config-Tiny.noarch perl-Time-HiRes.x86_64 perl-Parallel-ForkManager perl-Log-Dispatch
安裝完成後就可以開始安裝mha-manager
了
rpm -ivh mha4mysql-manager-0.57-0.el7.noarch.rpm
創建mha目錄
在監控節點(192.168.1.103
)上,創建mha
的配置目錄
## 配置目錄
mkdir -p /etc/mha
在每個節點上創建mha
工作目錄
## 工作目錄,該目錄用於master宕機時,slave將master的bin log拷貝到此目錄
mkdir -p /root/mha
創建mha賬戶
在master
節點(192.168.1.101
)上,創建mha
需要用到的賬戶,並授權
## 創建用戶
create user dba_mha@'192.168.1.%' identified by 'your password';
## 授權
grant all privileges on *.* to dba_mha@'192.168.1.%';
編輯配置文件
在監控節點(192.168.1.103
)上新建並編輯配置文件
vim /etc/mha/mysql-mha.conf
配置如下內容,根據自己的實際情況進行修改(password
、ip
、目錄等)
[server default]
user=dba_mha
## 注意改成自己的密碼
password=your password
manager_workdir=/root/mha
manager_log=/root/mha/manager.log
remote_workdir=/root/mha
ssh_user=root
repl_password=your password
ping_interval=1
master_binlog_dir=/home/mysql/sql_log
ssh_port=22
master_ip_failover_script=/usr/bin/master_ip_failover
secondary_check_script=/usr/bin/masterha_secondary_check -s 192.168.1.101 -s 192.168.1.102 -s 192.168.1.103
[server1]
hostname=192.168.1.101
candidate_master=1
[server2]
hostname=192.168.1.102
candidate_master=1
[server3]
hostname=192.168.1.103
## 該節點也是監控節點,所以關閉master候選
no_master=1
從配置文件可以看到,參數master_ip_failover_script
配置了master
故障時,需要執行寫VIP
的故障轉移腳本/usr/bin/master_ip_failover
。所以還需要配置這個腳本,創建並編輯這個腳本
vim /usr/bin/master_ip_failover
配置如下內容,根據自己的實際情況進行修改
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Getopt::Long;
my (
$command, $orig_master_host, $orig_master_ip,$ssh_user,
$orig_master_port, $new_master_host, $new_master_ip,$new_master_port,
$orig_master_ssh_port,$new_master_ssh_port,$new_master_user,$new_master_password
);
my $vip = '192.168.1.88/24';
my $key = '1';
my $ssh_start_vip = "/sbin/ifconfig ens33:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig ens33:$key down";
my $ssh_Bcast_arp= "/sbin/arping -I ens33 -c 3 -A 192.168.1.88";
GetOptions(
'command=s' => \$command,
'ssh_user=s' => \$ssh_user,
'orig_master_host=s' => \$orig_master_host,
'orig_master_ip=s' => \$orig_master_ip,
'orig_master_port=i' => \$orig_master_port,
'orig_master_ssh_port=i' => \$orig_master_ssh_port,
'new_master_host=s' => \$new_master_host,
'new_master_ip=s' => \$new_master_ip,
'new_master_port=i' => \$new_master_port,
'new_master_ssh_port' => \$new_master_ssh_port,
'new_master_user' => \$new_master_user,
'new_master_password' => \$new_master_password
);
exit &main();
sub main {
$ssh_user = defined $ssh_user ? $ssh_user : 'root';
print "\n\nIN SCRIPT TEST====$ssh_user|$ssh_stop_vip==$ssh_user|$ssh_start_vip===\n\n";
if ( $command eq "stop" || $command eq "stopssh" ) {
my $exit_code = 1;
eval {
print "Disabling the VIP on old master: $orig_master_host \n";
&stop_vip();
$exit_code = 0;
};
if ($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "start" ) {
my $exit_code = 10;
eval {
print "Enabling the VIP - $vip on the new master - $new_master_host \n";
&start_vip();
&start_arp();
$exit_code = 0;
};
if ($@) {
warn $@;
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "status" ) {
print "Checking the Status of the script.. OK \n";
exit 0;
}
else {
&usage();
exit 1;
}
}
sub start_vip() {
`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
sub stop_vip() {
`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}
sub start_arp() {
`ssh $ssh_user\@$new_master_host \" $ssh_Bcast_arp \"`;
}
sub usage {
print
"Usage: master_ip_failover --command=start|stop|stopssh|status --ssh_user=user --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}
配置文件中,值得注意的地方(第14行開始)如下圖所示
vip
表示讀的虛擬IP,而不是master
節點的IP。ens33
是網絡接口的名稱,可以通過ifconfig
查看
這個腳本實現了master
故障時,寫VIP
的自動轉移。
腳本編輯完成後,賦予可執行的權限
chmod +x /usr/bin/master_ip_failover
檢查配置
由於配置內容比較多,不能保證全部都正確,所以可以先校驗一下相關配置,在監控節點(192.168.1.103
)上執行
-
檢查
SSH
配置masterha_check_ssh --conf=/etc/mha/mysql-mha.conf
執行結果
Tue Jun 9 22:11:11 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Tue Jun 9 22:11:11 2020 - [info] Reading application default configuration from /etc/mha/mysql-mha.conf.. Tue Jun 9 22:11:11 2020 - [info] Reading server configuration from /etc/mha/mysql-mha.conf.. Tue Jun 9 22:11:11 2020 - [info] Starting SSH connection tests.. Tue Jun 9 22:11:16 2020 - [debug] Tue Jun 9 22:11:12 2020 - [debug] Connecting via SSH from [email protected](192.168.1.103:22) to [email protected](192.168.1.101:22).. Tue Jun 9 22:11:14 2020 - [debug] ok. Tue Jun 9 22:11:14 2020 - [debug] Connecting via SSH from [email protected](192.168.1.103:22) to [email protected](192.168.1.102:22).. Tue Jun 9 22:11:15 2020 - [debug] ok. Tue Jun 9 22:11:19 2020 - [debug] Tue Jun 9 22:11:11 2020 - [debug] Connecting via SSH from [email protected](192.168.1.101:22) to [email protected](192.168.1.102:22).. Tue Jun 9 22:11:17 2020 - [debug] ok. Tue Jun 9 22:11:17 2020 - [debug] Connecting via SSH from [email protected](192.168.1.101:22) to [email protected](192.168.1.103:22).. Tue Jun 9 22:11:18 2020 - [debug] ok. Tue Jun 9 22:11:25 2020 - [debug] Tue Jun 9 22:11:12 2020 - [debug] Connecting via SSH from [email protected](192.168.1.102:22) to [email protected](192.168.1.101:22).. Tue Jun 9 22:11:13 2020 - [debug] ok. Tue Jun 9 22:11:13 2020 - [debug] Connecting via SSH from [email protected](192.168.1.102:22) to [email protected](192.168.1.103:22).. Tue Jun 9 22:11:24 2020 - [debug] ok. Tue Jun 9 22:11:25 2020 - [info] All SSH connection tests passed successfully.
通過日誌可以看到
SSH
的配置正確 -
檢查主從複製的配置
masterha_check_repl --conf=/etc/mha/mysql-mha.conf
執行結果
Tue Jun 9 22:22:43 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Tue Jun 9 22:22:43 2020 - [info] Reading application default configuration from /etc/mha/mysql-mha.conf.. Tue Jun 9 22:22:43 2020 - [info] Reading server configuration from /etc/mha/mysql-mha.conf.. Tue Jun 9 22:22:43 2020 - [info] MHA::MasterMonitor version 0.57. Tue Jun 9 22:22:45 2020 - [info] GTID failover mode = 1 Tue Jun 9 22:22:45 2020 - [info] Dead Servers: Tue Jun 9 22:22:45 2020 - [info] Alive Servers: Tue Jun 9 22:22:45 2020 - [info] 192.168.1.101(192.168.1.101:3306) Tue Jun 9 22:22:45 2020 - [info] 192.168.1.102(192.168.1.102:3306) Tue Jun 9 22:22:45 2020 - [info] 192.168.1.103(192.168.1.103:3306) Tue Jun 9 22:22:45 2020 - [info] Alive Slaves: Tue Jun 9 22:22:45 2020 - [info] 192.168.1.102(192.168.1.102:3306) Version=5.7.30-log (oldest major version between slaves) log-bin:enabled Tue Jun 9 22:22:45 2020 - [info] GTID ON Tue Jun 9 22:22:45 2020 - [info] Replicating from 192.168.1.101(192.168.1.101:3306) Tue Jun 9 22:22:45 2020 - [info] Primary candidate for the new Master (candidate_master is set) Tue Jun 9 22:22:45 2020 - [info] 192.168.1.103(192.168.1.103:3306) Version=5.7.30-log (oldest major version between slaves) log-bin:enabled Tue Jun 9 22:22:45 2020 - [info] GTID ON Tue Jun 9 22:22:45 2020 - [info] Replicating from 192.168.1.101(192.168.1.101:3306) Tue Jun 9 22:22:45 2020 - [info] Not candidate for the new Master (no_master is set) Tue Jun 9 22:22:45 2020 - [info] Current Alive Master: 192.168.1.101(192.168.1.101:3306) Tue Jun 9 22:22:45 2020 - [info] Checking slave configurations.. Tue Jun 9 22:22:45 2020 - [info] read_only=1 is not set on slave 192.168.1.102(192.168.1.102:3306). Tue Jun 9 22:22:45 2020 - [info] read_only=1 is not set on slave 192.168.1.103(192.168.1.103:3306). Tue Jun 9 22:22:45 2020 - [info] Checking replication filtering settings.. Tue Jun 9 22:22:45 2020 - [info] binlog_do_db= , binlog_ignore_db= Tue Jun 9 22:22:45 2020 - [info] Replication filtering check ok. Tue Jun 9 22:22:45 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking. Tue Jun 9 22:22:45 2020 - [info] Checking SSH publickey authentication settings on the current master.. Tue Jun 9 22:22:50 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 192.168.1.101! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342. Tue Jun 9 22:22:50 2020 - [info] 192.168.1.101(192.168.1.101:3306) (current master) +--192.168.1.102(192.168.1.102:3306) +--192.168.1.103(192.168.1.103:3306) Tue Jun 9 22:22:50 2020 - [info] Checking replication health on 192.168.1.102.. Tue Jun 9 22:22:50 2020 - [info] ok. Tue Jun 9 22:22:50 2020 - [info] Checking replication health on 192.168.1.103.. Tue Jun 9 22:22:50 2020 - [info] ok. Tue Jun 9 22:22:50 2020 - [info] Checking master_ip_failover_script status: Tue Jun 9 22:22:50 2020 - [info] /usr/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.1.101 --orig_master_ip=192.168.1.101 --orig_master_port=3306 IN SCRIPT TEST====root|/sbin/ifconfig ens33:1 down==root|/sbin/ifconfig ens33:1 192.168.1.88/24=== Checking the Status of the script.. OK Tue Jun 9 22:22:50 2020 - [info] OK. Tue Jun 9 22:22:50 2020 - [warning] shutdown_script is not defined. Tue Jun 9 22:22:50 2020 - [info] Got exit code 0 (Not master dead). MySQL Replication Health is OK.
根據輸出可以看到,主從複製配置也正確。
更多的檢查方式可以通過ll /usr/bin/ |grep master
命令查看
master首次配置VIP
由於MHA
工具只會在故障時遷移VIP
,所以第一次啓動MHA
的時候,需要手動給master
節點(192.168.1.101
)配置一個寫VIP
,配置方式如下,在master
節點(192.168.1.101
)上執行如下命令(參數需要根據實際情況修改)
/sbin/ifconfig ens33:1 192.168.1.88/24
ens33
是網絡接口的名稱,192.168.1.88
是寫VIP
,這些配置在master_ip_failover
腳本中已經指定過。
配置寫VIP
之前,使用ifconfig
輸出如下
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.101 netmask 255.255.255.0 broadcast 192.168.1.255
inet6 fe80::bce6:1d30:472c:d811 prefixlen 64 scopeid 0x20<link>
inet6 2409:8a4c:a13:3f30:9d96:8b33:ca89:c62c prefixlen 64 scopeid 0x0<global>
ether 00:0c:29:28:70:7c txqueuelen 1000 (Ethernet)
RX packets 979338 bytes 460658144 (439.3 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 693198 bytes 278374776 (265.4 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 208973 bytes 18422224 (17.5 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 208973 bytes 18422224 (17.5 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
配置寫VIP
之後,ifconfig
名稱輸出如下
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.101 netmask 255.255.255.0 broadcast 192.168.1.255
inet6 fe80::bce6:1d30:472c:d811 prefixlen 64 scopeid 0x20<link>
inet6 2409:8a4c:a13:3f30:9d96:8b33:ca89:c62c prefixlen 64 scopeid 0x0<global>
ether 00:0c:29:28:70:7c txqueuelen 1000 (Ethernet)
RX packets 1040146 bytes 477864466 (455.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 733075 bytes 299370855 (285.5 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens33:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.88 netmask 255.255.255.0 broadcast 192.168.1.255
ether 00:0c:29:28:70:7c txqueuelen 1000 (Ethernet)
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 222701 bytes 19630288 (18.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 222701 bytes 19630288 (18.7 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
啓動MHA
在監控節點(192.168.1.103
)上執行如下命令(默認前臺運行)
masterha_manager --conf=/etc/mha/mysql-mha.conf
執行之後輸出日誌如下
Tue Jun 9 22:38:05 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Jun 9 22:38:05 2020 - [info] Reading application default configuration from /etc/mha/mysql-mha.conf..
Tue Jun 9 22:38:05 2020 - [info] Reading server configuration from /etc/mha/mysql-mha.conf..
可以看到MHA
已經成功啓動。
除此之外,/root/mha
目錄下還有兩個相關的文件manager.log
和mysql-mha.master_status.health
,分別用來記錄MHA
日誌和master
節點的健康狀態。
至此,MHA架構已經搭建完成。
因爲master
節點的VIP
是192.168.1.88
,所以寫操作只需要連接這個VIP
即可。如果連接不上,請開啓MySQL允許遠程訪問。
故障切換日誌
MHA
高可用搭建後,理論上是高可用的,即master
宕機後,馬上會提升一個slave
爲新的master
。但是理論歸理論,我們還是要實踐下。
以下日誌是master
宕機(只停止了MySQL
服務)後,MHA
監控工具打印的日誌。
Thu Jun 11 20:59:53 2020 - [warning] Got error on MySQL select ping: 2013 (Lost connection to MySQL server during query)
Thu Jun 11 20:59:53 2020 - [info] Executing secondary network check script: /usr/bin/masterha_secondary_check -s 192.168.1.101 -s 192.168.1.102 -s 192.168.1.103 --user=root --master_host=192.168.1.101 --master_ip=192.168.1.101 --master_port=3306 --master_user=dba_mha --master_password=Ppnn13y,dkst2yc. --ping_type=SELECT
Thu Jun 11 20:59:53 2020 - [info] Executing SSH check script: exit 0
Thu Jun 11 20:59:54 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.1.101' (111))
Thu Jun 11 20:59:54 2020 - [warning] Connection failed 2 time(s)..
Thu Jun 11 20:59:54 2020 - [info] HealthCheck: SSH to 192.168.1.101 is reachable.
Monitoring server 192.168.1.101 is reachable, Master is not reachable from 192.168.1.101. OK.
Thu Jun 11 20:59:55 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.1.101' (111))
Thu Jun 11 20:59:55 2020 - [warning] Connection failed 3 time(s)..
Monitoring server 192.168.1.102 is reachable, Master is not reachable from 192.168.1.102. OK.
Thu Jun 11 20:59:56 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.1.101' (111))
Thu Jun 11 20:59:56 2020 - [warning] Connection failed 4 time(s)..
Monitoring server 192.168.1.103 is reachable, Master is not reachable from 192.168.1.103. OK.
Thu Jun 11 20:59:56 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Thu Jun 11 20:59:56 2020 - [warning] Master is not reachable from health checker!
Thu Jun 11 20:59:56 2020 - [warning] Master 192.168.1.101(192.168.1.101:3306) is not reachable!
Thu Jun 11 20:59:56 2020 - [warning] SSH is reachable.
Thu Jun 11 20:59:56 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/mha/mysql-mha.conf again, and trying to connect to all servers to check server status..
Thu Jun 11 20:59:56 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Jun 11 20:59:56 2020 - [info] Reading application default configuration from /etc/mha/mysql-mha.conf..
Thu Jun 11 20:59:56 2020 - [info] Reading server configuration from /etc/mha/mysql-mha.conf..
Thu Jun 11 20:59:57 2020 - [info] GTID failover mode = 1
Thu Jun 11 20:59:57 2020 - [info] Dead Servers:
Thu Jun 11 20:59:57 2020 - [info] 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:57 2020 - [info] Alive Servers:
Thu Jun 11 20:59:57 2020 - [info] 192.168.1.102(192.168.1.102:3306)
Thu Jun 11 20:59:57 2020 - [info] 192.168.1.103(192.168.1.103:3306)
Thu Jun 11 20:59:57 2020 - [info] Alive Slaves:
Thu Jun 11 20:59:57 2020 - [info] 192.168.1.102(192.168.1.102:3306) Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:57 2020 - [info] GTID ON
Thu Jun 11 20:59:57 2020 - [info] Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:57 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jun 11 20:59:57 2020 - [info] 192.168.1.103(192.168.1.103:3306) Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:57 2020 - [info] GTID ON
Thu Jun 11 20:59:57 2020 - [info] Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:57 2020 - [info] Not candidate for the new Master (no_master is set)
Thu Jun 11 20:59:57 2020 - [info] Checking slave configurations..
Thu Jun 11 20:59:57 2020 - [info] read_only=1 is not set on slave 192.168.1.102(192.168.1.102:3306).
Thu Jun 11 20:59:57 2020 - [info] read_only=1 is not set on slave 192.168.1.103(192.168.1.103:3306).
Thu Jun 11 20:59:57 2020 - [info] Checking replication filtering settings..
Thu Jun 11 20:59:57 2020 - [info] Replication filtering check ok.
Thu Jun 11 20:59:57 2020 - [info] Master is down!
Thu Jun 11 20:59:57 2020 - [info] Terminating monitoring script.
Thu Jun 11 20:59:57 2020 - [info] Got exit code 20 (Master dead).
Thu Jun 11 20:59:57 2020 - [info] MHA::MasterFailover version 0.57.
Thu Jun 11 20:59:57 2020 - [info] Starting master failover.
Thu Jun 11 20:59:57 2020 - [info]
Thu Jun 11 20:59:57 2020 - [info] * Phase 1: Configuration Check Phase..
Thu Jun 11 20:59:57 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] GTID failover mode = 1
Thu Jun 11 20:59:59 2020 - [info] Dead Servers:
Thu Jun 11 20:59:59 2020 - [info] 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info] Checking master reachability via MySQL(double check)...
Thu Jun 11 20:59:59 2020 - [info] ok.
Thu Jun 11 20:59:59 2020 - [info] Alive Servers:
Thu Jun 11 20:59:59 2020 - [info] 192.168.1.102(192.168.1.102:3306)
Thu Jun 11 20:59:59 2020 - [info] 192.168.1.103(192.168.1.103:3306)
Thu Jun 11 20:59:59 2020 - [info] Alive Slaves:
Thu Jun 11 20:59:59 2020 - [info] 192.168.1.102(192.168.1.102:3306) Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info] GTID ON
Thu Jun 11 20:59:59 2020 - [info] Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jun 11 20:59:59 2020 - [info] 192.168.1.103(192.168.1.103:3306) Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info] GTID ON
Thu Jun 11 20:59:59 2020 - [info] Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info] Not candidate for the new Master (no_master is set)
Thu Jun 11 20:59:59 2020 - [info] Starting GTID based failover.
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Jun 11 20:59:59 2020 - [info] Executing master IP deactivation script:
Thu Jun 11 20:59:59 2020 - [info] /usr/bin/master_ip_failover --orig_master_host=192.168.1.101 --orig_master_ip=192.168.1.101 --orig_master_port=3306 --command=stopssh --ssh_user=root
IN SCRIPT TEST====root|/sbin/ifconfig ens33:1 down==root|/sbin/ifconfig ens33:1 192.168.1.88/24===
Disabling the VIP on old master: 192.168.1.101
Thu Jun 11 20:59:59 2020 - [info] done.
Thu Jun 11 20:59:59 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Jun 11 20:59:59 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] * Phase 3: Master Recovery Phase..
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000003:2435
Thu Jun 11 20:59:59 2020 - [info] Retrieved Gtid Set: 81502f9e-a592-11ea-b912-000c2928707c:12-16
Thu Jun 11 20:59:59 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Jun 11 20:59:59 2020 - [info] 192.168.1.102(192.168.1.102:3306) Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info] GTID ON
Thu Jun 11 20:59:59 2020 - [info] Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jun 11 20:59:59 2020 - [info] 192.168.1.103(192.168.1.103:3306) Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info] GTID ON
Thu Jun 11 20:59:59 2020 - [info] Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info] Not candidate for the new Master (no_master is set)
Thu Jun 11 20:59:59 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000003:2435
Thu Jun 11 20:59:59 2020 - [info] Retrieved Gtid Set: 81502f9e-a592-11ea-b912-000c2928707c:12-16
Thu Jun 11 20:59:59 2020 - [info] Oldest slaves:
Thu Jun 11 20:59:59 2020 - [info] 192.168.1.102(192.168.1.102:3306) Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info] GTID ON
Thu Jun 11 20:59:59 2020 - [info] Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jun 11 20:59:59 2020 - [info] 192.168.1.103(192.168.1.103:3306) Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info] GTID ON
Thu Jun 11 20:59:59 2020 - [info] Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info] Not candidate for the new Master (no_master is set)
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] * Phase 3.3: Determining New Master Phase..
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] Searching new master from slaves..
Thu Jun 11 20:59:59 2020 - [info] Candidate masters from the configuration file:
Thu Jun 11 20:59:59 2020 - [info] 192.168.1.102(192.168.1.102:3306) Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info] GTID ON
Thu Jun 11 20:59:59 2020 - [info] Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jun 11 20:59:59 2020 - [info] Non-candidate masters:
Thu Jun 11 20:59:59 2020 - [info] 192.168.1.103(192.168.1.103:3306) Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info] GTID ON
Thu Jun 11 20:59:59 2020 - [info] Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info] Not candidate for the new Master (no_master is set)
Thu Jun 11 20:59:59 2020 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Jun 11 20:59:59 2020 - [info] New master is 192.168.1.102(192.168.1.102:3306)
Thu Jun 11 20:59:59 2020 - [info] Starting master failover..
Thu Jun 11 20:59:59 2020 - [info]
From:
192.168.1.101(192.168.1.101:3306) (current master)
+--192.168.1.102(192.168.1.102:3306)
+--192.168.1.103(192.168.1.103:3306)
To:
192.168.1.102(192.168.1.102:3306) (new master)
+--192.168.1.103(192.168.1.103:3306)
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] Waiting all logs to be applied..
Thu Jun 11 20:59:59 2020 - [info] done.
Thu Jun 11 20:59:59 2020 - [info] Getting new master's binlog name and position..
Thu Jun 11 20:59:59 2020 - [info] mysql-bin.000002:463
Thu Jun 11 20:59:59 2020 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.1.102', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Jun 11 20:59:59 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000002, 463, 1dbd5375-a4d9-11ea-9eef-000c29cf4cca:1,
81502f9e-a592-11ea-b912-000c2928707c:1-16
Thu Jun 11 20:59:59 2020 - [info] Executing master IP activate script:
Thu Jun 11 20:59:59 2020 - [info] /usr/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.1.101 --orig_master_ip=192.168.1.101 --orig_master_port=3306 --new_master_host=192.168.1.102 --new_master_ip=192.168.1.102 --new_master_port=3306 --new_master_user='dba_mha' --new_master_password=xxx
Option new_master_user does not take an argument
Option new_master_password does not take an argument
IN SCRIPT TEST====root|/sbin/ifconfig ens33:1 down==root|/sbin/ifconfig ens33:1 192.168.1.88/24===
Enabling the VIP - 192.168.1.88/24 on the new master - 192.168.1.102
Thu Jun 11 21:00:02 2020 - [info] OK.
Thu Jun 11 21:00:02 2020 - [info] ** Finished master recovery successfully.
Thu Jun 11 21:00:02 2020 - [info] * Phase 3: Master Recovery Phase completed.
Thu Jun 11 21:00:02 2020 - [info]
Thu Jun 11 21:00:02 2020 - [info] * Phase 4: Slaves Recovery Phase..
Thu Jun 11 21:00:02 2020 - [info]
Thu Jun 11 21:00:02 2020 - [info]
Thu Jun 11 21:00:02 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Thu Jun 11 21:00:02 2020 - [info]
Thu Jun 11 21:00:02 2020 - [info] -- Slave recovery on host 192.168.1.103(192.168.1.103:3306) started, pid: 28647. Check tmp log /root/mha/192.168.1.103_3306_20200611205957.log if it takes time..
Thu Jun 11 21:00:04 2020 - [info]
Thu Jun 11 21:00:04 2020 - [info] Log messages from 192.168.1.103 ...
Thu Jun 11 21:00:04 2020 - [info]
Thu Jun 11 21:00:02 2020 - [info] Resetting slave 192.168.1.103(192.168.1.103:3306) and starting replication from the new master 192.168.1.102(192.168.1.102:3306)..
Thu Jun 11 21:00:02 2020 - [info] Executed CHANGE MASTER.
Thu Jun 11 21:00:03 2020 - [info] Slave started.
Thu Jun 11 21:00:03 2020 - [info] gtid_wait(1dbd5375-a4d9-11ea-9eef-000c29cf4cca:1,
81502f9e-a592-11ea-b912-000c2928707c:1-16) completed on 192.168.1.103(192.168.1.103:3306). Executed 0 events.
Thu Jun 11 21:00:04 2020 - [info] End of log messages from 192.168.1.103.
Thu Jun 11 21:00:04 2020 - [info] -- Slave on host 192.168.1.103(192.168.1.103:3306) started.
Thu Jun 11 21:00:04 2020 - [info] All new slave servers recovered successfully.
Thu Jun 11 21:00:04 2020 - [info]
Thu Jun 11 21:00:04 2020 - [info] * Phase 5: New master cleanup phase..
Thu Jun 11 21:00:04 2020 - [info]
Thu Jun 11 21:00:04 2020 - [info] Resetting slave info on the new master..
Thu Jun 11 21:00:04 2020 - [info] 192.168.1.102: Resetting slave info succeeded.
Thu Jun 11 21:00:04 2020 - [info] Master failover to 192.168.1.102(192.168.1.102:3306) completed successfully.
Thu Jun 11 21:00:04 2020 - [info]
----- Failover Report -----
mysql-mha: MySQL Master failover 192.168.1.101(192.168.1.101:3306) to 192.168.1.102(192.168.1.102:3306) succeeded
Master 192.168.1.101(192.168.1.101:3306) is down!
Check MHA Manager logs at localhost.localdomain:/root/mha/manager.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on 192.168.1.101(192.168.1.101:3306)
Selected 192.168.1.102(192.168.1.102:3306) as a new master.
192.168.1.102(192.168.1.102:3306): OK: Applying all logs succeeded.
192.168.1.102(192.168.1.102:3306): OK: Activated master IP address.
192.168.1.103(192.168.1.103:3306): OK: Slave started, replicating from 192.168.1.102(192.168.1.102:3306)
192.168.1.102(192.168.1.102:3306): Resetting slave info succeeded.
Master failover to 192.168.1.102(192.168.1.102:3306) completed successfully.
從日誌可以看到192.168.1.102
被提升爲了新的master
。
腦裂問題
如果原來的
master
恢復了,會不會搶回master
呢,還是會出現多個master
?
如果原來的master
恢復後,還是master
,那就是一個主從複製集羣中出現了兩個master
,這樣就出現了腦裂。
且看MySQL
如何解決這個問題。
重新啓動192.168.1.101
的MySQL
服務。因爲新的master
已經變成192.168.1.102
了,所以在新的master
上執行show slave hosts
查看master
上連接了幾個slave
show slave hosts;
+-----------+------+------+-----------+--------------------------------------+
| Server_id | Host | Port | Master_id | Slave_UUID |
+-----------+------+------+-----------+--------------------------------------+
| 103 | | 3306 | 102 | d6532e2a-a592-11ea-99c3-000c297f5b55 |
+-----------+------+------+-----------+--------------------------------------+
1 row in set (0.00 sec)
可以看到slave
只剩下192.168.1.103
,也就說原來的master
恢復後,並沒有搶回master
,也沒有成爲slave
。
如果想讓原來的master
加入集羣,需要重新配置
change master to master_host='192.168.1.102', master_user='repl', master_password='your password', master_auto_position=1;
start slave;
配置、啓動之後,再次查看master
的slave
節點
show slave hosts;
+-----------+------+------+-----------+--------------------------------------+
| Server_id | Host | Port | Master_id | Slave_UUID |
+-----------+------+------+-----------+--------------------------------------+
| 101 | | 3306 | 102 | 81502f9e-a592-11ea-b912-000c2928707c |
| 103 | | 3306 | 102 | d6532e2a-a592-11ea-99c3-000c297f5b55 |
+-----------+------+------+-----------+--------------------------------------+
2 rows in set (0.00 sec)
可以看到,原來的master
就在故障恢復之後成功的加入了集羣。
總結
MySQL
的高可用非常重要,手動搭建一個MHA
的高可用架構,可以讓我們更好的理解MHA
的工作原理,也讓我們在面對MySQL
故障時不至於束手無策。