手动搭建MySQL高可用架构MHA,小白也能轻松搭建

前言

MySQL作为数据存储工具,可以说是整个架构体系中最重要的一环都不为过。无论是怎样的架构,怎样的设计,都不能离开关系型数据库。如果数据库故障了,整个系统肯定是不可用的,所以MySQL的高可用非常重要。本篇主要从理论上讲解常见的MySQL高可用架构MMMMHA,以及从零开始,一步步搭建一个高可用的MHA架构。

前置阅读

本篇内容是基于上一篇来扩展的,所以请务必阅读一下这篇前置阅读。

高可用

MySQL主从复制常见的高可用架构有两种,MMMMHA。想要实现MySQL主从复制的高可用,需要实现以下几点功能

  • 对主从复制集群中的master节点进行监控
  • master节点宕机后把VIPVirtual IP Address,即虚拟IP)迁移到新的master节点
  • 重新配置集群中的其他slave节点从新的master同步

MMM架构

MMMMaster-Master replication manager for MySQL)是一套支持双主故障切换和双主日常管理的脚本程序。主要用来监控管理Master-Master(双主)复制,虽然叫做双主复制,但是同一个时刻只有一个master,另一个作为master的备份,以加速在主主切换时刻备选master的预热,一方面实现了故障自动切换的功能,另一方面也可以实现多个slave读负载均衡

MMM的整体架构图如下所示
MySQ高可用
结合MMM的架构图,我们可以知道

  • 整个架构需要两个master节点,两个master节点互为主备。同一时刻,只能有一个master对外提供服务
  • 可以有多个用于读操作的slave节点
  • master分配一个VIP只能在主备之间切换;给每个slave节点分配一个读VIP,可以在任意slave节点上切换

master宕机时,MMM管理工具会把所有的slave节点切换为主备的slave。并且把写VIP迁移到主备服务器上,slave节点从新的master节点上同步数据,整个过程简单粗暴,所以无法保证数据的一致性。

slave节点宕机时,MMM管理工具会把读VIP迁移到其他slave节点,slave节点可以有多个VIP

MMM架构缺点

  • 故障切换简单粗暴,容易丢失事务(可以采用半同步复制改进)
  • 不支持GTID的复制方式(可以自行修改perl脚本)
  • 社区不活跃,很久没更新新版本

MHA架构

MHAMaster High Availability)是一款开源的 MySQL 的高可用程序。MHA管理工具在监控到master节点故障时,会提升拥有master最新数据的slave节点成为新的master节点,并且会让其他的slave节点从新的master节点上同步数据。MHA还提供了master节点的在线切换功能,即按需切换master/slave节点。

MHA的架构图如下所示
MySQL高可用架构
MHA架构图可以知道,MHA只监控master的健康状态,当master宕机时,MHA管理工具会从master所有的slave中选出一个最接近master的节点提升为新的master

MHA故障迁移

MHA管理下的MySQL主从复制,master故障之后,会经过如下过程进行故障转移

  • 移除宕机的masterVIP,挑选具有最新数据的slave
  • 尝试从宕机的master保存二进制日志(如果仅仅是MySQL实例宕机,则有可能成功)
  • 应用差异的中继日志(relay log)到其它slave,因为作为备选master的节点的中继日志,和其它slave节点的中继日志可能有差异,所以要把备选master节点的中继日志应用到其它slave节点
  • 应用从master保存的二进制日志(如果第二步成功)
  • 把备选的master提升为新的master
  • 配置其他的slave从新的master同步,把写VIP迁移到新的master

MHA优势

  • 支持GTID的复制方式和基于日志点的复制方式
  • 可以从多个slave节点中选取最适合的master
  • 会尝试从旧的master中尽可能保存更多的未同步日志

MHA不足

  • 不一定能获取到原master的未同步日志
  • 需要自行开发写VIP转移脚本
  • 只监控master,而没有对slave实现高可用

MHA适用场景

  • 适用基于GTID的复制方式
  • 使用一主多从的复制架构
  • 希望更少的数据丢失场景

搭建MHA

整个MHA的搭建过程虽然不算复杂,但是涉及到的步骤较多,建议先整体阅读一下,再动手实践

配置节点间SSH免密

首先在master节点(192.168.1.101)上执行,一路回车即可

ssh-keygen

执行结果如下

Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:HFpSaM7IVW+TjQVUM0m1JBNgrnhH85O3wuur58sev1E [email protected]
The key's randomart image is:
+---[RSA 2048]----+
|       oo.==X+o  |
|      +. + =.* . |
|   . *. o X . .  |
|    o o* = + .   |
|      o S . + . E|
|       . . . o o |
|            + o  |
|           ..= . |
|          .*Ooo. |
+----[SHA256]-----+

把生成的/root/.ssh/id_rsa拷贝到三个节点上(包括自己)

ssh-copy-id -i /root/.ssh/id_rsa [email protected]
ssh-copy-id -i /root/.ssh/id_rsa [email protected]
ssh-copy-id -i /root/.ssh/id_rsa [email protected]

执行完成后,在192.168.1.101使用ssh命令连接到102103上是不需要密码的

ssh 192.168.1.102

上述操作需要在192.168.1.102192.168.1.103上均执行一遍

安装yum扩展包

  • 下载
    wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    
  • 安装
    rpm -ivh epel-release-latest-7.noarch.rpm
    
  • 修改
    vim /etc/yum.repos.d/epel.repo
    
    只需要修改一项内容,就是epel节点下的gpgcheck
    [epel]
    ...
    ## 只需要修改epel节点下的gpgcheck属性
    gpgcheck=0
    ...
    

上述操作需要在所有节点上均执行一遍

安装依赖

执行如下命令

yum -y install perl-DBD-MySQL ncftp perl-DBI.x86

上述操作需要在所有节点上均执行一遍

安装MHA管理工具

下载地址:https://download.csdn.net/download/Baisitao_/12505957

## 安装mha-node
rpm -ivh mha4mysql-node-0.57-0.el7.noarch.rpm

上述操作需要在所有节点上均执行一遍

安装管理节点

严格来说,监控工具应该安装在一个单独的节点,此处为了节约一个节点,就安装在192.168.1.103上。

yum -y install perl-Config-Tiny.noarch perl-Time-HiRes.x86_64 perl-Parallel-ForkManager perl-Log-Dispatch

安装完成后就可以开始安装mha-manager

rpm -ivh mha4mysql-manager-0.57-0.el7.noarch.rpm

创建mha目录

在监控节点(192.168.1.103)上,创建mha的配置目录

## 配置目录
mkdir -p /etc/mha

每个节点上创建mha工作目录

## 工作目录,该目录用于master宕机时,slave将master的bin log拷贝到此目录
mkdir -p /root/mha

创建mha账户

master节点(192.168.1.101)上,创建mha需要用到的账户,并授权

## 创建用户
create user dba_mha@'192.168.1.%' identified by 'your password';
## 授权
grant all privileges on *.* to dba_mha@'192.168.1.%';

编辑配置文件

在监控节点(192.168.1.103)上新建并编辑配置文件

vim /etc/mha/mysql-mha.conf

配置如下内容,根据自己的实际情况进行修改(passwordip、目录等)

[server default]
user=dba_mha
## 注意改成自己的密码
password=your password
manager_workdir=/root/mha
manager_log=/root/mha/manager.log
remote_workdir=/root/mha
ssh_user=root
repl_password=your password
ping_interval=1
master_binlog_dir=/home/mysql/sql_log
ssh_port=22
master_ip_failover_script=/usr/bin/master_ip_failover
secondary_check_script=/usr/bin/masterha_secondary_check -s 192.168.1.101 -s 192.168.1.102 -s 192.168.1.103

[server1]
hostname=192.168.1.101
candidate_master=1
[server2]
hostname=192.168.1.102
candidate_master=1
[server3]
hostname=192.168.1.103
## 该节点也是监控节点,所以关闭master候选
no_master=1

从配置文件可以看到,参数master_ip_failover_script配置了master故障时,需要执行写VIP的故障转移脚本/usr/bin/master_ip_failover。所以还需要配置这个脚本,创建并编辑这个脚本

vim /usr/bin/master_ip_failover

配置如下内容,根据自己的实际情况进行修改

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
    $command, $orig_master_host, $orig_master_ip,$ssh_user,
    $orig_master_port, $new_master_host, $new_master_ip,$new_master_port,
    $orig_master_ssh_port,$new_master_ssh_port,$new_master_user,$new_master_password
);

my $vip = '192.168.1.88/24';
my $key = '1';
my $ssh_start_vip = "/sbin/ifconfig ens33:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig ens33:$key down";
my $ssh_Bcast_arp= "/sbin/arping -I ens33 -c 3 -A 192.168.1.88";

GetOptions(
    'command=s'          => \$command,
    'ssh_user=s'         => \$ssh_user,
    'orig_master_host=s' => \$orig_master_host,
    'orig_master_ip=s'   => \$orig_master_ip,
    'orig_master_port=i' => \$orig_master_port,
    'orig_master_ssh_port=i' => \$orig_master_ssh_port,
    'new_master_host=s'  => \$new_master_host,
    'new_master_ip=s'    => \$new_master_ip,
    'new_master_port=i'  => \$new_master_port,
    'new_master_ssh_port' => \$new_master_ssh_port,
    'new_master_user' => \$new_master_user,
    'new_master_password' => \$new_master_password

);

exit &main();

sub main {
    $ssh_user = defined $ssh_user ? $ssh_user : 'root';
    print "\n\nIN SCRIPT TEST====$ssh_user|$ssh_stop_vip==$ssh_user|$ssh_start_vip===\n\n";

    if ( $command eq "stop" || $command eq "stopssh" ) {

        my $exit_code = 1;
        eval {
            print "Disabling the VIP on old master: $orig_master_host \n";
            &stop_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn "Got Error: $@\n";
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "start" ) {

        my $exit_code = 10;
        eval {
            print "Enabling the VIP - $vip on the new master - $new_master_host \n";
            &start_vip();
	    &start_arp();
            $exit_code = 0;
        };
        if ($@) {
            warn $@;
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "status" ) {
        print "Checking the Status of the script.. OK \n";
        exit 0;
    }
    else {
        &usage();
        exit 1;
    }
}

sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
sub stop_vip() {
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub start_arp() {
    `ssh $ssh_user\@$new_master_host \" $ssh_Bcast_arp \"`;
}
sub usage {
    print
    "Usage: master_ip_failover --command=start|stop|stopssh|status --ssh_user=user --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

配置文件中,值得注意的地方(第14行开始)如下图所示
master故障切换
vip表示读的虚拟IP,而不是master节点的IPens33网络接口的名称,可以通过ifconfig查看
查看网络接口
这个脚本实现了master故障时,写VIP的自动转移。

脚本编辑完成后,赋予可执行的权限

chmod +x /usr/bin/master_ip_failover

检查配置

由于配置内容比较多,不能保证全部都正确,所以可以先校验一下相关配置,在监控节点(192.168.1.103)上执行

  • 检查SSH配置

    masterha_check_ssh --conf=/etc/mha/mysql-mha.conf
    

    执行结果

    Tue Jun  9 22:11:11 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
    Tue Jun  9 22:11:11 2020 - [info] Reading application default configuration from /etc/mha/mysql-mha.conf..
    Tue Jun  9 22:11:11 2020 - [info] Reading server configuration from /etc/mha/mysql-mha.conf..
    Tue Jun  9 22:11:11 2020 - [info] Starting SSH connection tests..
    Tue Jun  9 22:11:16 2020 - [debug]
    Tue Jun  9 22:11:12 2020 - [debug]  Connecting via SSH from [email protected](192.168.1.103:22) to [email protected](192.168.1.101:22)..
    Tue Jun  9 22:11:14 2020 - [debug]   ok.
    Tue Jun  9 22:11:14 2020 - [debug]  Connecting via SSH from [email protected](192.168.1.103:22) to [email protected](192.168.1.102:22)..
    Tue Jun  9 22:11:15 2020 - [debug]   ok.
    Tue Jun  9 22:11:19 2020 - [debug]
    Tue Jun  9 22:11:11 2020 - [debug]  Connecting via SSH from [email protected](192.168.1.101:22) to [email protected](192.168.1.102:22)..
    Tue Jun  9 22:11:17 2020 - [debug]   ok.
    Tue Jun  9 22:11:17 2020 - [debug]  Connecting via SSH from [email protected](192.168.1.101:22) to [email protected](192.168.1.103:22)..
    Tue Jun  9 22:11:18 2020 - [debug]   ok.
    Tue Jun  9 22:11:25 2020 - [debug]
    Tue Jun  9 22:11:12 2020 - [debug]  Connecting via SSH from [email protected](192.168.1.102:22) to [email protected](192.168.1.101:22)..
    Tue Jun  9 22:11:13 2020 - [debug]   ok.
    Tue Jun  9 22:11:13 2020 - [debug]  Connecting via SSH from [email protected](192.168.1.102:22) to [email protected](192.168.1.103:22)..
    Tue Jun  9 22:11:24 2020 - [debug]   ok.
    Tue Jun  9 22:11:25 2020 - [info] All SSH connection tests passed successfully.
    

    通过日志可以看到SSH的配置正确

  • 检查主从复制的配置

    masterha_check_repl --conf=/etc/mha/mysql-mha.conf
    

    执行结果

    Tue Jun  9 22:22:43 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
    Tue Jun  9 22:22:43 2020 - [info] Reading application default configuration from /etc/mha/mysql-mha.conf..
    Tue Jun  9 22:22:43 2020 - [info] Reading server configuration from /etc/mha/mysql-mha.conf..
    Tue Jun  9 22:22:43 2020 - [info] MHA::MasterMonitor version 0.57.
    Tue Jun  9 22:22:45 2020 - [info] GTID failover mode = 1
    Tue Jun  9 22:22:45 2020 - [info] Dead Servers:
    Tue Jun  9 22:22:45 2020 - [info] Alive Servers:
    Tue Jun  9 22:22:45 2020 - [info]   192.168.1.101(192.168.1.101:3306)
    Tue Jun  9 22:22:45 2020 - [info]   192.168.1.102(192.168.1.102:3306)
    Tue Jun  9 22:22:45 2020 - [info]   192.168.1.103(192.168.1.103:3306)
    Tue Jun  9 22:22:45 2020 - [info] Alive Slaves:
    Tue Jun  9 22:22:45 2020 - [info]   192.168.1.102(192.168.1.102:3306)  Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
    Tue Jun  9 22:22:45 2020 - [info]     GTID ON
    Tue Jun  9 22:22:45 2020 - [info]     Replicating from 192.168.1.101(192.168.1.101:3306)
    Tue Jun  9 22:22:45 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
    Tue Jun  9 22:22:45 2020 - [info]   192.168.1.103(192.168.1.103:3306)  Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
    Tue Jun  9 22:22:45 2020 - [info]     GTID ON
    Tue Jun  9 22:22:45 2020 - [info]     Replicating from 192.168.1.101(192.168.1.101:3306)
    Tue Jun  9 22:22:45 2020 - [info]     Not candidate for the new Master (no_master is set)
    Tue Jun  9 22:22:45 2020 - [info] Current Alive Master: 192.168.1.101(192.168.1.101:3306)
    Tue Jun  9 22:22:45 2020 - [info] Checking slave configurations..
    Tue Jun  9 22:22:45 2020 - [info]  read_only=1 is not set on slave 192.168.1.102(192.168.1.102:3306).
    Tue Jun  9 22:22:45 2020 - [info]  read_only=1 is not set on slave 192.168.1.103(192.168.1.103:3306).
    Tue Jun  9 22:22:45 2020 - [info] Checking replication filtering settings..
    Tue Jun  9 22:22:45 2020 - [info]  binlog_do_db= , binlog_ignore_db=
    Tue Jun  9 22:22:45 2020 - [info]  Replication filtering check ok.
    Tue Jun  9 22:22:45 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
    Tue Jun  9 22:22:45 2020 - [info] Checking SSH publickey authentication settings on the current master..
    Tue Jun  9 22:22:50 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 192.168.1.101! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
    Tue Jun  9 22:22:50 2020 - [info]
    192.168.1.101(192.168.1.101:3306) (current master)
     +--192.168.1.102(192.168.1.102:3306)
     +--192.168.1.103(192.168.1.103:3306)
    
    Tue Jun  9 22:22:50 2020 - [info] Checking replication health on 192.168.1.102..
    Tue Jun  9 22:22:50 2020 - [info]  ok.
    Tue Jun  9 22:22:50 2020 - [info] Checking replication health on 192.168.1.103..
    Tue Jun  9 22:22:50 2020 - [info]  ok.
    Tue Jun  9 22:22:50 2020 - [info] Checking master_ip_failover_script status:
    Tue Jun  9 22:22:50 2020 - [info]   /usr/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.1.101 --orig_master_ip=192.168.1.101 --orig_master_port=3306
    
    
    IN SCRIPT TEST====root|/sbin/ifconfig ens33:1 down==root|/sbin/ifconfig ens33:1 192.168.1.88/24===
    
    Checking the Status of the script.. OK
    Tue Jun  9 22:22:50 2020 - [info]  OK.
    Tue Jun  9 22:22:50 2020 - [warning] shutdown_script is not defined.
    Tue Jun  9 22:22:50 2020 - [info] Got exit code 0 (Not master dead).
    
    MySQL Replication Health is OK.
    

    根据输出可以看到,主从复制配置也正确。

更多的检查方式可以通过ll /usr/bin/ |grep master命令查看

master首次配置VIP

由于MHA工具只会在故障时迁移VIP,所以第一次启动MHA的时候,需要手动给master节点(192.168.1.101)配置一个写VIP,配置方式如下,在master节点(192.168.1.101)上执行如下命令(参数需要根据实际情况修改)

/sbin/ifconfig ens33:1 192.168.1.88/24

ens33是网络接口的名称,192.168.1.88是写VIP,这些配置在master_ip_failover脚本中已经指定过。

配置写VIP之前,使用ifconfig输出如下

ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.101  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::bce6:1d30:472c:d811  prefixlen 64  scopeid 0x20<link>
        inet6 2409:8a4c:a13:3f30:9d96:8b33:ca89:c62c  prefixlen 64  scopeid 0x0<global>
        ether 00:0c:29:28:70:7c  txqueuelen 1000  (Ethernet)
        RX packets 979338  bytes 460658144 (439.3 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 693198  bytes 278374776 (265.4 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 208973  bytes 18422224 (17.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 208973  bytes 18422224 (17.5 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

配置写VIP之后,ifconfig名称输出如下

ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.101  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::bce6:1d30:472c:d811  prefixlen 64  scopeid 0x20<link>
        inet6 2409:8a4c:a13:3f30:9d96:8b33:ca89:c62c  prefixlen 64  scopeid 0x0<global>
        ether 00:0c:29:28:70:7c  txqueuelen 1000  (Ethernet)
        RX packets 1040146  bytes 477864466 (455.7 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 733075  bytes 299370855 (285.5 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens33:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.88  netmask 255.255.255.0  broadcast 192.168.1.255
        ether 00:0c:29:28:70:7c  txqueuelen 1000  (Ethernet)

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 222701  bytes 19630288 (18.7 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 222701  bytes 19630288 (18.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

启动MHA

在监控节点(192.168.1.103)上执行如下命令(默认前台运行)

masterha_manager --conf=/etc/mha/mysql-mha.conf

执行之后输出日志如下

Tue Jun  9 22:38:05 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Jun  9 22:38:05 2020 - [info] Reading application default configuration from /etc/mha/mysql-mha.conf..
Tue Jun  9 22:38:05 2020 - [info] Reading server configuration from /etc/mha/mysql-mha.conf..

可以看到MHA已经成功启动。

除此之外,/root/mha目录下还有两个相关的文件manager.logmysql-mha.master_status.health,分别用来记录MHA日志和master节点的健康状态。

至此,MHA架构已经搭建完成。

因为master节点的VIP192.168.1.88,所以写操作只需要连接这个VIP即可。如果连接不上,请开启MySQL允许远程访问

故障切换日志

MHA高可用搭建后,理论上是高可用的,即master宕机后,马上会提升一个slave为新的master。但是理论归理论,我们还是要实践下。

以下日志是master宕机(只停止了MySQL服务)后,MHA监控工具打印的日志。

Thu Jun 11 20:59:53 2020 - [warning] Got error on MySQL select ping: 2013 (Lost connection to MySQL server during query)
Thu Jun 11 20:59:53 2020 - [info] Executing secondary network check script: /usr/bin/masterha_secondary_check -s 192.168.1.101 -s 192.168.1.102 -s 192.168.1.103  --user=root  --master_host=192.168.1.101  --master_ip=192.168.1.101  --master_port=3306 --master_user=dba_mha --master_password=Ppnn13y,dkst2yc. --ping_type=SELECT
Thu Jun 11 20:59:53 2020 - [info] Executing SSH check script: exit 0
Thu Jun 11 20:59:54 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.1.101' (111))
Thu Jun 11 20:59:54 2020 - [warning] Connection failed 2 time(s)..
Thu Jun 11 20:59:54 2020 - [info] HealthCheck: SSH to 192.168.1.101 is reachable.
Monitoring server 192.168.1.101 is reachable, Master is not reachable from 192.168.1.101. OK.
Thu Jun 11 20:59:55 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.1.101' (111))
Thu Jun 11 20:59:55 2020 - [warning] Connection failed 3 time(s)..
Monitoring server 192.168.1.102 is reachable, Master is not reachable from 192.168.1.102. OK.
Thu Jun 11 20:59:56 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.1.101' (111))
Thu Jun 11 20:59:56 2020 - [warning] Connection failed 4 time(s)..
Monitoring server 192.168.1.103 is reachable, Master is not reachable from 192.168.1.103. OK.
Thu Jun 11 20:59:56 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Thu Jun 11 20:59:56 2020 - [warning] Master is not reachable from health checker!
Thu Jun 11 20:59:56 2020 - [warning] Master 192.168.1.101(192.168.1.101:3306) is not reachable!
Thu Jun 11 20:59:56 2020 - [warning] SSH is reachable.
Thu Jun 11 20:59:56 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/mha/mysql-mha.conf again, and trying to connect to all servers to check server status..
Thu Jun 11 20:59:56 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Jun 11 20:59:56 2020 - [info] Reading application default configuration from /etc/mha/mysql-mha.conf..
Thu Jun 11 20:59:56 2020 - [info] Reading server configuration from /etc/mha/mysql-mha.conf..
Thu Jun 11 20:59:57 2020 - [info] GTID failover mode = 1
Thu Jun 11 20:59:57 2020 - [info] Dead Servers:
Thu Jun 11 20:59:57 2020 - [info]   192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:57 2020 - [info] Alive Servers:
Thu Jun 11 20:59:57 2020 - [info]   192.168.1.102(192.168.1.102:3306)
Thu Jun 11 20:59:57 2020 - [info]   192.168.1.103(192.168.1.103:3306)
Thu Jun 11 20:59:57 2020 - [info] Alive Slaves:
Thu Jun 11 20:59:57 2020 - [info]   192.168.1.102(192.168.1.102:3306)  Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:57 2020 - [info]     GTID ON
Thu Jun 11 20:59:57 2020 - [info]     Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:57 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Jun 11 20:59:57 2020 - [info]   192.168.1.103(192.168.1.103:3306)  Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:57 2020 - [info]     GTID ON
Thu Jun 11 20:59:57 2020 - [info]     Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:57 2020 - [info]     Not candidate for the new Master (no_master is set)
Thu Jun 11 20:59:57 2020 - [info] Checking slave configurations..
Thu Jun 11 20:59:57 2020 - [info]  read_only=1 is not set on slave 192.168.1.102(192.168.1.102:3306).
Thu Jun 11 20:59:57 2020 - [info]  read_only=1 is not set on slave 192.168.1.103(192.168.1.103:3306).
Thu Jun 11 20:59:57 2020 - [info] Checking replication filtering settings..
Thu Jun 11 20:59:57 2020 - [info]  Replication filtering check ok.
Thu Jun 11 20:59:57 2020 - [info] Master is down!
Thu Jun 11 20:59:57 2020 - [info] Terminating monitoring script.
Thu Jun 11 20:59:57 2020 - [info] Got exit code 20 (Master dead).
Thu Jun 11 20:59:57 2020 - [info] MHA::MasterFailover version 0.57.
Thu Jun 11 20:59:57 2020 - [info] Starting master failover.
Thu Jun 11 20:59:57 2020 - [info]
Thu Jun 11 20:59:57 2020 - [info] * Phase 1: Configuration Check Phase..
Thu Jun 11 20:59:57 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] GTID failover mode = 1
Thu Jun 11 20:59:59 2020 - [info] Dead Servers:
Thu Jun 11 20:59:59 2020 - [info]   192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info] Checking master reachability via MySQL(double check)...
Thu Jun 11 20:59:59 2020 - [info]  ok.
Thu Jun 11 20:59:59 2020 - [info] Alive Servers:
Thu Jun 11 20:59:59 2020 - [info]   192.168.1.102(192.168.1.102:3306)
Thu Jun 11 20:59:59 2020 - [info]   192.168.1.103(192.168.1.103:3306)
Thu Jun 11 20:59:59 2020 - [info] Alive Slaves:
Thu Jun 11 20:59:59 2020 - [info]   192.168.1.102(192.168.1.102:3306)  Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info]     GTID ON
Thu Jun 11 20:59:59 2020 - [info]     Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Jun 11 20:59:59 2020 - [info]   192.168.1.103(192.168.1.103:3306)  Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info]     GTID ON
Thu Jun 11 20:59:59 2020 - [info]     Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info]     Not candidate for the new Master (no_master is set)
Thu Jun 11 20:59:59 2020 - [info] Starting GTID based failover.
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Jun 11 20:59:59 2020 - [info] Executing master IP deactivation script:
Thu Jun 11 20:59:59 2020 - [info]   /usr/bin/master_ip_failover --orig_master_host=192.168.1.101 --orig_master_ip=192.168.1.101 --orig_master_port=3306 --command=stopssh --ssh_user=root


IN SCRIPT TEST====root|/sbin/ifconfig ens33:1 down==root|/sbin/ifconfig ens33:1 192.168.1.88/24===

Disabling the VIP on old master: 192.168.1.101
Thu Jun 11 20:59:59 2020 - [info]  done.
Thu Jun 11 20:59:59 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Jun 11 20:59:59 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] * Phase 3: Master Recovery Phase..
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000003:2435
Thu Jun 11 20:59:59 2020 - [info] Retrieved Gtid Set: 81502f9e-a592-11ea-b912-000c2928707c:12-16
Thu Jun 11 20:59:59 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Jun 11 20:59:59 2020 - [info]   192.168.1.102(192.168.1.102:3306)  Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info]     GTID ON
Thu Jun 11 20:59:59 2020 - [info]     Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Jun 11 20:59:59 2020 - [info]   192.168.1.103(192.168.1.103:3306)  Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info]     GTID ON
Thu Jun 11 20:59:59 2020 - [info]     Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info]     Not candidate for the new Master (no_master is set)
Thu Jun 11 20:59:59 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000003:2435
Thu Jun 11 20:59:59 2020 - [info] Retrieved Gtid Set: 81502f9e-a592-11ea-b912-000c2928707c:12-16
Thu Jun 11 20:59:59 2020 - [info] Oldest slaves:
Thu Jun 11 20:59:59 2020 - [info]   192.168.1.102(192.168.1.102:3306)  Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info]     GTID ON
Thu Jun 11 20:59:59 2020 - [info]     Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Jun 11 20:59:59 2020 - [info]   192.168.1.103(192.168.1.103:3306)  Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info]     GTID ON
Thu Jun 11 20:59:59 2020 - [info]     Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info]     Not candidate for the new Master (no_master is set)
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] * Phase 3.3: Determining New Master Phase..
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] Searching new master from slaves..
Thu Jun 11 20:59:59 2020 - [info]  Candidate masters from the configuration file:
Thu Jun 11 20:59:59 2020 - [info]   192.168.1.102(192.168.1.102:3306)  Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info]     GTID ON
Thu Jun 11 20:59:59 2020 - [info]     Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Jun 11 20:59:59 2020 - [info]  Non-candidate masters:
Thu Jun 11 20:59:59 2020 - [info]   192.168.1.103(192.168.1.103:3306)  Version=5.7.30-log (oldest major version between slaves) log-bin:enabled
Thu Jun 11 20:59:59 2020 - [info]     GTID ON
Thu Jun 11 20:59:59 2020 - [info]     Replicating from 192.168.1.101(192.168.1.101:3306)
Thu Jun 11 20:59:59 2020 - [info]     Not candidate for the new Master (no_master is set)
Thu Jun 11 20:59:59 2020 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Thu Jun 11 20:59:59 2020 - [info] New master is 192.168.1.102(192.168.1.102:3306)
Thu Jun 11 20:59:59 2020 - [info] Starting master failover..
Thu Jun 11 20:59:59 2020 - [info]
From:
192.168.1.101(192.168.1.101:3306) (current master)
 +--192.168.1.102(192.168.1.102:3306)
 +--192.168.1.103(192.168.1.103:3306)

To:
192.168.1.102(192.168.1.102:3306) (new master)
 +--192.168.1.103(192.168.1.103:3306)
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Thu Jun 11 20:59:59 2020 - [info]
Thu Jun 11 20:59:59 2020 - [info]  Waiting all logs to be applied..
Thu Jun 11 20:59:59 2020 - [info]   done.
Thu Jun 11 20:59:59 2020 - [info] Getting new master's binlog name and position..
Thu Jun 11 20:59:59 2020 - [info]  mysql-bin.000002:463
Thu Jun 11 20:59:59 2020 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.1.102', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Jun 11 20:59:59 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000002, 463, 1dbd5375-a4d9-11ea-9eef-000c29cf4cca:1,
81502f9e-a592-11ea-b912-000c2928707c:1-16
Thu Jun 11 20:59:59 2020 - [info] Executing master IP activate script:
Thu Jun 11 20:59:59 2020 - [info]   /usr/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.1.101 --orig_master_ip=192.168.1.101 --orig_master_port=3306 --new_master_host=192.168.1.102 --new_master_ip=192.168.1.102 --new_master_port=3306 --new_master_user='dba_mha'   --new_master_password=xxx
Option new_master_user does not take an argument
Option new_master_password does not take an argument


IN SCRIPT TEST====root|/sbin/ifconfig ens33:1 down==root|/sbin/ifconfig ens33:1 192.168.1.88/24===

Enabling the VIP - 192.168.1.88/24 on the new master - 192.168.1.102
Thu Jun 11 21:00:02 2020 - [info]  OK.
Thu Jun 11 21:00:02 2020 - [info] ** Finished master recovery successfully.
Thu Jun 11 21:00:02 2020 - [info] * Phase 3: Master Recovery Phase completed.
Thu Jun 11 21:00:02 2020 - [info]
Thu Jun 11 21:00:02 2020 - [info] * Phase 4: Slaves Recovery Phase..
Thu Jun 11 21:00:02 2020 - [info]
Thu Jun 11 21:00:02 2020 - [info]
Thu Jun 11 21:00:02 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Thu Jun 11 21:00:02 2020 - [info]
Thu Jun 11 21:00:02 2020 - [info] -- Slave recovery on host 192.168.1.103(192.168.1.103:3306) started, pid: 28647. Check tmp log /root/mha/192.168.1.103_3306_20200611205957.log if it takes time..
Thu Jun 11 21:00:04 2020 - [info]
Thu Jun 11 21:00:04 2020 - [info] Log messages from 192.168.1.103 ...
Thu Jun 11 21:00:04 2020 - [info]
Thu Jun 11 21:00:02 2020 - [info]  Resetting slave 192.168.1.103(192.168.1.103:3306) and starting replication from the new master 192.168.1.102(192.168.1.102:3306)..
Thu Jun 11 21:00:02 2020 - [info]  Executed CHANGE MASTER.
Thu Jun 11 21:00:03 2020 - [info]  Slave started.
Thu Jun 11 21:00:03 2020 - [info]  gtid_wait(1dbd5375-a4d9-11ea-9eef-000c29cf4cca:1,
81502f9e-a592-11ea-b912-000c2928707c:1-16) completed on 192.168.1.103(192.168.1.103:3306). Executed 0 events.
Thu Jun 11 21:00:04 2020 - [info] End of log messages from 192.168.1.103.
Thu Jun 11 21:00:04 2020 - [info] -- Slave on host 192.168.1.103(192.168.1.103:3306) started.
Thu Jun 11 21:00:04 2020 - [info] All new slave servers recovered successfully.
Thu Jun 11 21:00:04 2020 - [info]
Thu Jun 11 21:00:04 2020 - [info] * Phase 5: New master cleanup phase..
Thu Jun 11 21:00:04 2020 - [info]
Thu Jun 11 21:00:04 2020 - [info] Resetting slave info on the new master..
Thu Jun 11 21:00:04 2020 - [info]  192.168.1.102: Resetting slave info succeeded.
Thu Jun 11 21:00:04 2020 - [info] Master failover to 192.168.1.102(192.168.1.102:3306) completed successfully.
Thu Jun 11 21:00:04 2020 - [info]

----- Failover Report -----

mysql-mha: MySQL Master failover 192.168.1.101(192.168.1.101:3306) to 192.168.1.102(192.168.1.102:3306) succeeded

Master 192.168.1.101(192.168.1.101:3306) is down!

Check MHA Manager logs at localhost.localdomain:/root/mha/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 192.168.1.101(192.168.1.101:3306)
Selected 192.168.1.102(192.168.1.102:3306) as a new master.
192.168.1.102(192.168.1.102:3306): OK: Applying all logs succeeded.
192.168.1.102(192.168.1.102:3306): OK: Activated master IP address.
192.168.1.103(192.168.1.103:3306): OK: Slave started, replicating from 192.168.1.102(192.168.1.102:3306)
192.168.1.102(192.168.1.102:3306): Resetting slave info succeeded.
Master failover to 192.168.1.102(192.168.1.102:3306) completed successfully.

从日志可以看到192.168.1.102被提升为了新的master

脑裂问题

如果原来的master恢复了,会不会抢回master呢,还是会出现多个master

如果原来的master恢复后,还是master,那就是一个主从复制集群中出现了两个master,这样就出现了脑裂

且看MySQL如何解决这个问题。

重新启动192.168.1.101MySQL服务。因为新的master已经变成192.168.1.102了,所以在新的master上执行show slave hosts查看master上连接了几个slave

show slave hosts;
+-----------+------+------+-----------+--------------------------------------+
| Server_id | Host | Port | Master_id | Slave_UUID                           |
+-----------+------+------+-----------+--------------------------------------+
|       103 |      | 3306 |       102 | d6532e2a-a592-11ea-99c3-000c297f5b55 |
+-----------+------+------+-----------+--------------------------------------+
1 row in set (0.00 sec)

可以看到slave只剩下192.168.1.103,也就说原来的master恢复后,并没有抢回master,也没有成为slave

如果想让原来的master加入集群,需要重新配置

change master to master_host='192.168.1.102', master_user='repl', master_password='your password', master_auto_position=1;
start slave;

配置、启动之后,再次查看masterslave节点

 show slave hosts;
+-----------+------+------+-----------+--------------------------------------+
| Server_id | Host | Port | Master_id | Slave_UUID                           |
+-----------+------+------+-----------+--------------------------------------+
|       101 |      | 3306 |       102 | 81502f9e-a592-11ea-b912-000c2928707c |
|       103 |      | 3306 |       102 | d6532e2a-a592-11ea-99c3-000c297f5b55 |
+-----------+------+------+-----------+--------------------------------------+
2 rows in set (0.00 sec)

可以看到,原来的master就在故障恢复之后成功的加入了集群。

总结

MySQL的高可用非常重要,手动搭建一个MHA的高可用架构,可以让我们更好的理解MHA的工作原理,也让我们在面对MySQL故障时不至于束手无策。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章