一、环境准备
下面来测试MHA提供的各种功能,包括自动切换、手动切换、在线切换三种常用模式。下面是MHA manager的启动配置。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
[root@mha ~]# cat /etc/masterha/app1.cnf [server default] manager_log=/var/log/masterha/app1/manager.log manager_workdir=/var/log/masterha/app1 master_binlog_dir=/data/mysql/log/binlog/ password=123456 ping_interval=3 remote_workdir=/data/mysql/mha repl_password=123456 repl_user=mysql_slave secondary_check_script=/usr/local/bin/masterha_secondary_check -s mysql02 -s mysql03 --user=root --port=22 --master_host=mysql01 --master_ip=10.99.73.9 --master_port=3306 --ping_type=SELECT report_script=/usr/local/bin/send_report ssh_user=root ssh_port=22 user=mha [server1] candidate_master=1 hostname=10.99.73.9 port=3306 [server2] candidate_master=1 hostname=10.99.73.10 port=3306 [server3] hostname=10.99.73.11 no_master=1 port=3306 |
注意这里没有设置master_ip_failover_script与master_ip_online_change_script脚本,因为我们这里还没有涉及到VIP,所以可以没有这两个脚本,但仅限于测试。
通过前面简单实验,我们知道mha可以正常切换主从,但是当MySQL正常切换之后,应用中是无法正常自动切换的,这时就需要VIP了。VIP配置可以采用两种方式,一种通过keepalived的方式管理虚拟ip的浮动;另外一种通过脚本方式启动虚拟ip的方式(即不需要keepalived或者heartbeat类似的软件)。
一旦有了VIP,所以也就需要给master_ip_failover_script与master_ip_online_change_script提供脚本了。为了防止脑裂发生,推荐生产环境采用脚本的方式来管理虚拟ip,而不是使用keepalived来完成。但是当我们有了VIP之后,在切换时还有一个问题,我们知道网络模型OSI中二层数据链路是靠MAC地址通信的,所以上层无论三层或二层交换机是缓存有VIP的ARP(MAC和IP对应关系)表,交换机靠这个表来进行数据包的转换转发。所以当我们VIP切换后,上层交换机并不会立马刷新自己的ARP缓存表,这就需要我们人工干预了。
其实就是在我们进行切换时,可以通过使用一个arping命令实现。arping是在局域网中使用ARP请求判断目标主机是否在线的工具。你可以使用IP地址或MAC地址作为它的测试目标(因为APRING程序工作于OSI模型中的第二层,ARP协议的数据包无法通过路由器和网关,所以它只能用来检测局域网中的主机)。其命令语法如下:
1 |
Usage: arping [-fqbDUAV] [-c count] [-w timeout] [-I device] [-s source] destination |
常用参数介绍:
-b
:用于发送以太网广播帧(FFFFFFFFFFFF)。arping一开始使用广播地址,在收到响应后就使用unicast地址。
-q
:quiet output不显示任何信息。
-f
:表示在收到第一个响应报文后就退出。
-w timeout
:设定一个超时时间,单位是秒。如果到了指定时间,arping还没到完全收到响应则退出。
-c count
:表示发送指定数量的ARP请求数据包后就停止。如果指定了deadline选项,则arping会等待相同数量的arp响应包,直到超时为止。
-s source
:设定arping发送的arp数据包中的SPA字段的值。如果为空,则按下面处理,如果是DAD模式(冲突地址探测),则设置为0.0.0.0,如果是Unsolicited ARP模式(Gratutious ARP)则设置为目标地址,否则从路由表得出。
-I interface
:设置ping使用的网络接口。
整体脚本代码如下,根据环境不同可能需要稍微修改,比如ssh连接的端口号和用户名,能提取的都已经变量化了。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
#!/usr/bin/env perl use strict; use warnings FATAL => 'all'; use Getopt::Long; my ( $orig_master_port, $orig_master_host, $orig_master_ip, $orig_master_password, $orig_master_user, $orig_master_is_new_slave, $orig_master_ssh_user, $orig_master_ssh_port, $new_master_user, $new_master_host, $new_master_ip, $new_master_port, $new_master_ssh_user, $new_master_ssh_port, $new_master_password, $ssh_user, $ssh_port, $command, $vip, $interface ,$gateway ); GetOptions( 'command=s' => \$command, 'new_master_host=s' => \$new_master_host, 'new_master_ip=s' => \$new_master_ip, 'new_master_port=i' => \$new_master_port, 'new_master_user=s' => \$new_master_user, 'new_master_password=s' => \$new_master_password, 'new_master_ssh_user=s' => \$new_master_ssh_user, 'new_master_ssh_port=s' => \$new_master_ssh_port, 'orig_master_host=s' => \$orig_master_host, 'orig_master_ip=s' => \$orig_master_ip, 'orig_master_port=i' => \$orig_master_port, 'orig_master_user=s' => \$orig_master_user, 'orig_master_password=s' => \$orig_master_password, 'orig_master_is_new_slave' => \$orig_master_is_new_slave, 'orig_master_ssh_user=s' => \$orig_master_ssh_user, 'orig_master_ssh_port=s' => \$orig_master_ssh_port, 'ssh_user=s' => \$ssh_user, 'ssh_port=i' => \$ssh_port, 'vip=s' => \$vip, 'gateway=s' => \$gateway, 'interface=s' => \$interface, ); $new_master_ssh_user = "root"; $new_master_ssh_port = "22"; $orig_master_ssh_user = "root"; $orig_master_ssh_port = "22"; my $key = '1'; my @vipnomask=(split(/\//,$vip))[0]; my $ssh_start_vip = "sudo /sbin/ifconfig $interface:$key $vip && sudo /sbin/arping -I $interface -c 1 -s @vipnomask $gateway"; my $ssh_stop_vip = "sudo /sbin/ifconfig $interface:$key down"; exit &main(); sub main { print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n"; if ( $command eq "stop" || $command eq "stopssh" ) { my $exit_code = 1; eval { print "Disabling the VIP on old master: $orig_master_host \n"; &stop_vip(); $exit_code = 0; }; if ($@) { warn "Got Error: $@\n"; exit $exit_code; } exit $exit_code; } elsif ( $command eq "start" ) { my $exit_code = 10; eval { print "Enabling the VIP - $vip on the new master - $new_master_host \n"; &start_vip(); $exit_code = 0; }; if ($@) { warn $@; exit $exit_code; } exit $exit_code; } elsif ( $command eq "status" ) { print "Checking the Status of the script.. OK \n"; exit 0; } else { &usage(); exit 1; } } sub start_vip() { `ssh -p $new_master_ssh_port $new_master_ssh_user\@$new_master_host \" $ssh_start_vip \"`; } sub stop_vip() { `ssh -p $orig_master_ssh_port $orig_master_ssh_user\@$orig_master_host \" $ssh_stop_vip \"`; } sub usage { print "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n"; } |
使用方式如下,自动故障切换和在线切换都是使用这个脚本即可,配置如下:
1 2 |
master_ip_failover_script=/usr/local/bin/master_ip_online_change --interface=eth0 --vip=10.99.73.100/32 --gateway=10.99.73.1 master_ip_online_change_script=/usr/local/bin/master_ip_online_change --interface=eth0 --vip=10.99.73.100/32 --gateway=10.99.73.1 |
二、测试自动Failover(主要测试new master是补谁的日志)
必须先启动MHA Manager,否则无法自动切换,当然手动切换不需要开启MHA Manager监控。
1 |
[root@mha ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover & |
检查主从复制状态
1 2 3 4 5 6 7 |
[root@mha ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf ........... 10.99.73.9(10.99.73.9:3306) (current master) +--10.99.73.10(10.99.73.10:3306) +--10.99.73.11(10.99.73.11:3306) ........... MySQL Replication Health is OK. |
自动failover模拟测试的操作步骤如下。
1. 使用sysbench生成测试数据
自行安装 sysbench 工具。
在主库(10.99.73.9)上进行sysbench数据生成,在sbtest库下生成sbtest表,共100W记录。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
$ sysbench /usr/share/sysbench/oltp_read_only.lua \ --mysql-host=127.0.0.1 \ --mysql-port=3306 \ --mysql-user=mha \ --mysql-password=123456 \ --mysql-socket=/data/mysql/3306/mysql.sock \ --mysql-db=sbtest \ --db-driver=mysql \ --tables=1 \ --table-size=1000000 \ --report-interval=10 \ --threads=128 \ --time=120 \ prepare |
如果是需要产生大量的binlog,使用sysbench模拟压测,持续时间为3分钟,产生大量的binlog。使用如下语句(这里我们不使用):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
$ sysbench /usr/share/sysbench/oltp_read_only.lua \ --mysql-host=127.0.0.1 \ --mysql-port=3306 \ --mysql-user=mha \ --mysql-password=123456 \ --mysql-socket=/data/mysql/3306/mysql.sock \ --mysql-db=sbtest \ --db-driver=mysql \ --tables=1 \ --table-size=1000000 \ --report-interval=10 \ --threads=128 \ --time=120 \ run |
2. 停掉slave io线程(10.99.73.10)
一定要在master执行sysbench结束之前停掉slave io线程,这样我们才可以模拟查看当10.99.73.10成为new master之后是否会同步old master的binlog。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
mysql> stop slave io_thread; Query OK, 0 rows affected (0.10 sec) mysql> select count(*) from sbtest.sbtest; +----------+ | count(*) | +----------+ | 730000 | +----------+ 1 row in set (0.24 sec) mysql> show master status; +------------------+----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+----------+--------------+------------------+-------------------+ | mysql-bin.000002 | 45473439 | | | | +------------------+----------+--------------+------------------+-------------------+ 1 row in set (0.00 sec) |
另外一台slave我们没有停止io线程,所以还在继续接收执行日志。
3. 停掉slave io线程(10.99.73.11)
你可以一直使用show slave status命令查看当前此slave的状态,只要比10.99.73.10同步的数据多就可以停掉io线程了,但是记住不要把master的数据完全同步完了。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
mysql> stop slave io_thread; Query OK, 0 rows affected (0.03 sec) mysql> select count(*) from sbtest.sbtest; +----------+ | count(*) | +----------+ | 780000 | +----------+ 1 row in set (0.21 sec) mysql> show master status; +------------------+----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+----------+--------------+------------------+-------------------+ | mysql-bin.000002 | 48587999 | | | | +------------------+----------+--------------+------------------+-------------------+ 1 row in set (0.00 sec) |
4. 杀掉主库mysql进程
模拟主库发生故障,进行自动failover操作。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
mysql> select count(*) from sbtest.sbtest; +----------+ | count(*) | +----------+ | 1000000 | +----------+ 1 row in set (0.20 sec) mysql> show master status; +------------------+-----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+-----------+--------------+------------------+-------------------+ | mysql-bin.000002 | 123299163 | | | | +------------------+-----------+--------------+------------------+-------------------+ 1 row in set (0.00 sec) |
1 |
$ mysqladmin -S /data/mysql/3306/mysql.sock shutdown |
PS:可以看出master是100万数据,slave(10.99.73.10)只有90万数据,而slave(10.99.73.11)有97万数据,下面看整个MHA的切换过程。
5. 查看MHA切换日志
了解整个切换过程,在10.99.73.7上查看日志。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 |
[root@mha ~]# cat /var/log/masterha/app1/manager.log Mon Feb 20 15:22:21 2017 - [warning] Got error on MySQL select ping: 2013 (Lost connection to MySQL server during query) Mon Feb 20 15:22:21 2017 - [info] Executing secondary network check script: /usr/local/bin/masterha_secondary_check -s mysql02 -s mysql03 --user=root --master_host=mysql01 --master_ip=10.99.73.9 --master_port=3306 --user=root --master_host=10.99.73.9 --master_ip=10.99.73.9 --master_port=3306 --master_user=mha --master_password=123456 --ping_type=SELECT Mon Feb 20 15:22:21 2017 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql/log/binlog --output_file=/data/mysql/mha/save_binary_logs_test --manager_version=0.57 --binlog_prefix=mysql-bin Mon Feb 20 15:22:21 2017 - [info] HealthCheck: SSH to 10.99.73.9 is reachable. Monitoring server mysql02 is reachable, Master is not reachable from mysql02. OK. Monitoring server mysql03 is reachable, Master is not reachable from mysql03. OK. Mon Feb 20 15:22:21 2017 - [info] Master is not reachable from all other monitoring servers. Failover should start. Mon Feb 20 15:22:24 2017 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.99.73.9' (111)) Mon Feb 20 15:22:24 2017 - [warning] Connection failed 2 time(s).. Mon Feb 20 15:22:27 2017 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.99.73.9' (111)) Mon Feb 20 15:22:27 2017 - [warning] Connection failed 3 time(s).. Mon Feb 20 15:22:30 2017 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.99.73.9' (111)) Mon Feb 20 15:22:30 2017 - [warning] Connection failed 4 time(s).. Mon Feb 20 15:22:30 2017 - [warning] Master is not reachable from health checker! Mon Feb 20 15:22:30 2017 - [warning] Master 10.99.73.9(10.99.73.9:3306) is not reachable! Mon Feb 20 15:22:30 2017 - [warning] SSH is reachable. Mon Feb 20 15:22:30 2017 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.cnf again, and trying to connect to all servers to check server status.. Mon Feb 20 15:22:30 2017 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Mon Feb 20 15:22:30 2017 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Mon Feb 20 15:22:30 2017 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Mon Feb 20 15:22:30 2017 - [info] GTID failover mode = 0 Mon Feb 20 15:22:30 2017 - [info] Dead Servers: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Alive Servers: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.10(10.99.73.10:3306) Mon Feb 20 15:22:30 2017 - [info] 10.99.73.11(10.99.73.11:3306) Mon Feb 20 15:22:30 2017 - [info] Alive Slaves: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.10(10.99.73.10:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:30 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Primary candidate for the new Master (candidate_master is set) Mon Feb 20 15:22:30 2017 - [info] 10.99.73.11(10.99.73.11:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:30 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Not candidate for the new Master (no_master is set) Mon Feb 20 15:22:30 2017 - [info] Checking slave configurations.. Mon Feb 20 15:22:30 2017 - [info] Checking replication filtering settings.. Mon Feb 20 15:22:30 2017 - [info] Replication filtering check ok. Mon Feb 20 15:22:30 2017 - [info] Master is down! Mon Feb 20 15:22:30 2017 - [info] Terminating monitoring script. Mon Feb 20 15:22:30 2017 - [info] Got exit code 20 (Master dead). Mon Feb 20 15:22:30 2017 - [info] MHA::MasterFailover version 0.57. Mon Feb 20 15:22:30 2017 - [info] Starting master failover. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] * Phase 1: Configuration Check Phase.. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] GTID failover mode = 0 Mon Feb 20 15:22:30 2017 - [info] Dead Servers: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Checking master reachability via MySQL(double check)... Mon Feb 20 15:22:30 2017 - [info] ok. Mon Feb 20 15:22:30 2017 - [info] Alive Servers: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.10(10.99.73.10:3306) Mon Feb 20 15:22:30 2017 - [info] 10.99.73.11(10.99.73.11:3306) Mon Feb 20 15:22:30 2017 - [info] Alive Slaves: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.10(10.99.73.10:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:30 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Primary candidate for the new Master (candidate_master is set) Mon Feb 20 15:22:30 2017 - [info] 10.99.73.11(10.99.73.11:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:30 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Not candidate for the new Master (no_master is set) Mon Feb 20 15:22:30 2017 - [info] Starting Non-GTID based failover. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] ** Phase 1: Configuration Check Phase completed. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase.. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] Forcing shutdown so that applications never connect to the current master.. Mon Feb 20 15:22:30 2017 - [warning] master_ip_failover_script is not set. Skipping invalidating dead master IP address. Mon Feb 20 15:22:30 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. Mon Feb 20 15:22:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] * Phase 3: Master Recovery Phase.. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase.. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] The latest binary log file/position on all slaves is mysql-bin.000002:97142702 Mon Feb 20 15:22:30 2017 - [info] Latest slaves (Slaves that received relay log files to the latest): Mon Feb 20 15:22:30 2017 - [info] 10.99.73.11(10.99.73.11:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:30 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Not candidate for the new Master (no_master is set) Mon Feb 20 15:22:30 2017 - [info] The oldest binary log file/position on all slaves is mysql-bin.000002:90008622 Mon Feb 20 15:22:30 2017 - [info] Oldest slaves: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.10(10.99.73.10:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:30 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Primary candidate for the new Master (candidate_master is set) Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase.. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] Fetching dead master's binary logs.. Mon Feb 20 15:22:30 2017 - [info] Executing command on the dead master 10.99.73.9(10.99.73.9:3306): save_binary_logs --command=save --start_file=mysql-bin.000002 --start_pos=97142702 --binlog_dir=/data/mysql/log/binlog --output_file=/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 Creating /data/mysql/mha if not exists.. ok. Concat binary/relay logs from mysql-bin.000002 pos 97142702 to mysql-bin.000002 EOF into /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog .. Binlog Checksum enabled Dumping binlog format description event, from position 0 to 154.. ok. Dumping effective binlog data from /data/mysql/log/binlog/mysql-bin.000002 position 97142702 to tail(123299186).. ok. Binlog Checksum enabled Concat succeeded. Mon Feb 20 15:22:31 2017 - [info] scp from root@10.99.73.9:/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog to local:/var/log/masterha/app1/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog succeeded. Mon Feb 20 15:22:31 2017 - [info] HealthCheck: SSH to 10.99.73.10 is reachable. Mon Feb 20 15:22:32 2017 - [info] HealthCheck: SSH to 10.99.73.11 is reachable. Mon Feb 20 15:22:32 2017 - [info] Mon Feb 20 15:22:32 2017 - [info] * Phase 3.3: Determining New Master Phase.. Mon Feb 20 15:22:32 2017 - [info] Mon Feb 20 15:22:32 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves.. Mon Feb 20 15:22:32 2017 - [info] Checking whether 10.99.73.11 has relay logs from the oldest position.. Mon Feb 20 15:22:32 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=mysql-bin.000002 --latest_rmlp=97142702 --target_mlf=mysql-bin.000002 --target_rmlp=90008622 --server_id=103309 --workdir=/data/mysql/mha --timestamp=20170220152230 --manager_version=0.57 --relay_dir=/data/mysql/3306/log/relaylog --current_relay_log=relay-log.000003 : Relay log found at /data/mysql/3306/log/relaylog, up to relay-log.000003 Fast relay log position search succeeded. Target relay log file/position found. start_file:relay-log.000003, start_pos:90008835. Target relay log FOUND! Mon Feb 20 15:22:32 2017 - [info] OK. 10.99.73.11 has all relay logs. Mon Feb 20 15:22:32 2017 - [info] Searching new master from slaves.. Mon Feb 20 15:22:32 2017 - [info] Candidate masters from the configuration file: Mon Feb 20 15:22:32 2017 - [info] 10.99.73.10(10.99.73.10:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:32 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:32 2017 - [info] Primary candidate for the new Master (candidate_master is set) Mon Feb 20 15:22:32 2017 - [info] Non-candidate masters: Mon Feb 20 15:22:32 2017 - [info] 10.99.73.11(10.99.73.11:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:32 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:32 2017 - [info] Not candidate for the new Master (no_master is set) Mon Feb 20 15:22:32 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events.. Mon Feb 20 15:22:32 2017 - [info] Not found. Mon Feb 20 15:22:32 2017 - [info] Searching from all candidate_master slaves.. Mon Feb 20 15:22:32 2017 - [info] New master is 10.99.73.10(10.99.73.10:3306) Mon Feb 20 15:22:32 2017 - [info] Starting master failover.. Mon Feb 20 15:22:32 2017 - [info] From: 10.99.73.9(10.99.73.9:3306) (current master) +--10.99.73.10(10.99.73.10:3306) +--10.99.73.11(10.99.73.11:3306) To: 10.99.73.10(10.99.73.10:3306) (new master) +--10.99.73.11(10.99.73.11:3306) Mon Feb 20 15:22:32 2017 - [info] Mon Feb 20 15:22:32 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase.. Mon Feb 20 15:22:32 2017 - [info] Mon Feb 20 15:22:32 2017 - [info] Server 10.99.73.10 received relay logs up to: mysql-bin.000002:90008622 Mon Feb 20 15:22:32 2017 - [info] Need to get diffs from the latest slave(10.99.73.11) up to: mysql-bin.000002:97142702 (using the latest slave's relay logs) Mon Feb 20 15:22:32 2017 - [info] Connecting to the latest slave host 10.99.73.11, generating diff relay log files.. Mon Feb 20 15:22:32 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=10.99.73.10 --latest_mlf=mysql-bin.000002 --latest_rmlp=97142702 --target_mlf=mysql-bin.000002 --target_rmlp=90008622 --server_id=103309 --diff_file_readtolatest=/data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog --workdir=/data/mysql/mha --timestamp=20170220152230 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 --relay_dir=/data/mysql/3306/log/relaylog --current_relay_log=relay-log.000003 Mon Feb 20 15:22:33 2017 - [info] Relay log found at /data/mysql/3306/log/relaylog, up to relay-log.000003 Fast relay log position search succeeded. Target relay log file/position found. start_file:relay-log.000003, start_pos:90008835. Concat binary/relay logs from relay-log.000003 pos 90008835 to relay-log.000003 EOF into /data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog .. Binlog Checksum enabled Binlog Checksum enabled Dumping binlog format description event, from position 0 to 367.. ok. Dumping effective binlog data from /data/mysql/3306/log/relaylog/relay-log.000003 position 90008835 to tail(97142915).. ok. Binlog Checksum enabled Binlog Checksum enabled Concat succeeded. Generating diff relay log succeeded. Saved at /data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog . scp mysql03:/data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog to root@10.99.73.10(22) succeeded. Mon Feb 20 15:22:33 2017 - [info] Generating diff files succeeded. Mon Feb 20 15:22:33 2017 - [info] Sending binlog.. Mon Feb 20 15:22:34 2017 - [info] scp from local:/var/log/masterha/app1/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog to root@10.99.73.10:/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog succeeded. Mon Feb 20 15:22:34 2017 - [info] Mon Feb 20 15:22:34 2017 - [info] * Phase 3.4: Master Log Apply Phase.. Mon Feb 20 15:22:34 2017 - [info] Mon Feb 20 15:22:34 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed. Mon Feb 20 15:22:34 2017 - [info] Starting recovery on 10.99.73.10(10.99.73.10:3306).. Mon Feb 20 15:22:34 2017 - [info] Generating diffs succeeded. Mon Feb 20 15:22:34 2017 - [info] Waiting until all relay logs are applied. Mon Feb 20 15:22:34 2017 - [info] done. Mon Feb 20 15:22:34 2017 - [info] Getting slave status.. Mon Feb 20 15:22:34 2017 - [info] This slave(10.99.73.10)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000002:90008622). No need to recover from Exec_Master_Log_Pos. Mon Feb 20 15:22:34 2017 - [info] Connecting to the target slave host 10.99.73.10, running recover script.. Mon Feb 20 15:22:34 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='mha' --slave_host=10.99.73.10 --slave_ip=10.99.73.10 --slave_port=3306 --apply_files=/data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog,/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog --workdir=/data/mysql/mha --target_version=5.7.17-log --timestamp=20170220152230 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 --slave_pass=xxx Mon Feb 20 15:22:37 2017 - [info] Concat all apply files to /data/mysql/mha/total_binlog_for_10.99.73.10_3306.20170220152230.binlog .. Copying the first binlog file /data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog to /data/mysql/mha/total_binlog_for_10.99.73.10_3306.20170220152230.binlog.. ok. Dumping binlog head events (rotate events), skipping format description events from /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog.. Binlog Checksum enabled dumped up to pos 154. ok. /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog has effective binlog events from pos 154. Dumping effective binlog data from /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog position 154 to tail(26156638).. ok. Concat succeeded. All apply target binary logs are concatinated at /data/mysql/mha/total_binlog_for_10.99.73.10_3306.20170220152230.binlog . MySQL client version is 5.7.17. Using --binary-mode. Applying differential binary/relay log files /data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog,/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog on 10.99.73.10:3306. This may take long time... Applying log files succeeded. Mon Feb 20 15:22:37 2017 - [info] All relay logs were successfully applied. Mon Feb 20 15:22:37 2017 - [info] Getting new master's binlog name and position.. Mon Feb 20 15:22:37 2017 - [info] mysql-bin.000002:62292198 Mon Feb 20 15:22:37 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.99.73.10', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000002', MASTER_LOG_POS=62292198, MASTER_USER='mysql_slave', MASTER_PASSWORD='xxx'; Mon Feb 20 15:22:37 2017 - [warning] master_ip_failover_script is not set. Skipping taking over new master IP address. Mon Feb 20 15:22:37 2017 - [info] Setting read_only=0 on 10.99.73.10(10.99.73.10:3306).. Mon Feb 20 15:22:37 2017 - [info] ok. Mon Feb 20 15:22:37 2017 - [info] ** Finished master recovery successfully. Mon Feb 20 15:22:37 2017 - [info] * Phase 3: Master Recovery Phase completed. Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] * Phase 4: Slaves Recovery Phase.. Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] -- Slave diff file generation on host 10.99.73.11(10.99.73.11:3306) started, pid: 9926. Check tmp log /var/log/masterha/app1/10.99.73.11_3306_20170220152230.log if it takes time.. Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] Log messages from 10.99.73.11 ... Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Mon Feb 20 15:22:37 2017 - [info] End of log messages from 10.99.73.11. Mon Feb 20 15:22:37 2017 - [info] -- 10.99.73.11(10.99.73.11:3306) has the latest relay log events. Mon Feb 20 15:22:37 2017 - [info] Generating relay diff files from the latest slave succeeded. Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase.. Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] -- Slave recovery on host 10.99.73.11(10.99.73.11:3306) started, pid: 9928. Check tmp log /var/log/masterha/app1/10.99.73.11_3306_20170220152230.log if it takes time.. Mon Feb 20 15:22:40 2017 - [info] Mon Feb 20 15:22:40 2017 - [info] Log messages from 10.99.73.11 ... Mon Feb 20 15:22:40 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] Sending binlog.. Mon Feb 20 15:22:37 2017 - [info] scp from local:/var/log/masterha/app1/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog to root@10.99.73.11:/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog succeeded. Mon Feb 20 15:22:37 2017 - [info] Starting recovery on 10.99.73.11(10.99.73.11:3306).. Mon Feb 20 15:22:37 2017 - [info] Generating diffs succeeded. Mon Feb 20 15:22:37 2017 - [info] Waiting until all relay logs are applied. Mon Feb 20 15:22:37 2017 - [info] done. Mon Feb 20 15:22:37 2017 - [info] Getting slave status.. Mon Feb 20 15:22:37 2017 - [info] This slave(10.99.73.11)'s Exec_Master_Log_Pos(mysql-bin.000002:96173537) does not equal to Read_Master_Log_Pos(mysql-bin.000002:97142702). It is likely that relay log was cut during transaction. Need to recover from Exec_Master_Log_Pos. Mon Feb 20 15:22:37 2017 - [info] Saving local relay logs from exec pos to read pos on 10.99.73.11: from relay-log.000003:96173750 to the end of the relay log.. Mon Feb 20 15:22:37 2017 - [info] Executing command : save_binary_logs --command=save --start_file=relay-log.000003 --start_pos=96173750 --output_file=/data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220152230.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 --binlog_dir=/data/mysql/3306/log/relaylog Mon Feb 20 15:22:38 2017 - [info] Creating /data/mysql/mha if not exists.. ok. Concat binary/relay logs from relay-log.000003 pos 96173750 to relay-log.000003 EOF into /data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220152230.binlog .. Binlog Checksum enabled Binlog Checksum enabled Dumping binlog format description event, from position 0 to 367.. ok. Dumping effective binlog data from /data/mysql/3306/log/relaylog/relay-log.000003 position 96173750 to tail(97142915).. ok. Binlog Checksum enabled Binlog Checksum enabled Concat succeeded. Mon Feb 20 15:22:38 2017 - [info] Connecting to the target slave host 10.99.73.11, running recover script.. Mon Feb 20 15:22:38 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='mha' --slave_host=10.99.73.11 --slave_ip=10.99.73.11 --slave_port=3306 --apply_files=/data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220152230.binlog,/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog --workdir=/data/mysql/mha --target_version=5.7.17-log --timestamp=20170220152230 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 --slave_pass=xxx Mon Feb 20 15:22:40 2017 - [info] Concat all apply files to /data/mysql/mha/total_binlog_for_10.99.73.11_3306.20170220152230.binlog .. Copying the first binlog file /data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220152230.binlog to /data/mysql/mha/total_binlog_for_10.99.73.11_3306.20170220152230.binlog.. ok. Dumping binlog head events (rotate events), skipping format description events from /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog.. Binlog Checksum enabled dumped up to pos 154. ok. /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog has effective binlog events from pos 154. Dumping effective binlog data from /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog position 154 to tail(26156638).. ok. Concat succeeded. All apply target binary logs are concatinated at /data/mysql/mha/total_binlog_for_10.99.73.11_3306.20170220152230.binlog . MySQL client version is 5.7.17. Using --binary-mode. Applying differential binary/relay log files /data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220152230.binlog,/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog on 10.99.73.11:3306. This may take long time... Applying log files succeeded. Mon Feb 20 15:22:40 2017 - [info] All relay logs were successfully applied. Mon Feb 20 15:22:40 2017 - [info] Resetting slave 10.99.73.11(10.99.73.11:3306) and starting replication from the new master 10.99.73.10(10.99.73.10:3306).. Mon Feb 20 15:22:40 2017 - [info] Executed CHANGE MASTER. Mon Feb 20 15:22:40 2017 - [info] Slave started. Mon Feb 20 15:22:40 2017 - [info] End of log messages from 10.99.73.11. Mon Feb 20 15:22:40 2017 - [info] -- Slave recovery on host 10.99.73.11(10.99.73.11:3306) succeeded. Mon Feb 20 15:22:40 2017 - [info] All new slave servers recovered successfully. Mon Feb 20 15:22:40 2017 - [info] Mon Feb 20 15:22:40 2017 - [info] * Phase 5: New master cleanup phase.. Mon Feb 20 15:22:40 2017 - [info] Mon Feb 20 15:22:40 2017 - [info] Resetting slave info on the new master.. Mon Feb 20 15:22:40 2017 - [info] 10.99.73.10: Resetting slave info succeeded. Mon Feb 20 15:22:40 2017 - [info] Master failover to 10.99.73.10(10.99.73.10:3306) completed successfully. Mon Feb 20 15:22:40 2017 - [info] Deleted server1 entry from /etc/masterha/app1.cnf . Mon Feb 20 15:22:40 2017 - [info] ----- Failover Report ----- app1: MySQL Master failover 10.99.73.9(10.99.73.9:3306) to 10.99.73.10(10.99.73.10:3306) succeeded Master 10.99.73.9(10.99.73.9:3306) is down! Check MHA Manager logs at mha:/var/log/masterha/app1/manager.log for details. Started automated(non-interactive) failover. The latest slave 10.99.73.11(10.99.73.11:3306) has all relay logs for recovery. Selected 10.99.73.10(10.99.73.10:3306) as a new master. 10.99.73.10(10.99.73.10:3306): OK: Applying all logs succeeded. 10.99.73.11(10.99.73.11:3306): This host has the latest relay log events. Generating relay diff files from the latest slave succeeded. 10.99.73.11(10.99.73.11:3306): OK: Applying all logs succeeded. Slave started, replicating from 10.99.73.10(10.99.73.10:3306) 10.99.73.10(10.99.73.10:3306): Resetting slave info succeeded. Master failover to 10.99.73.10(10.99.73.10:3306) completed successfully. Mon Feb 20 15:22:40 2017 - [info] Sending mail.. |
看到最后如果出现”Master failover to 10.99.73.10(10.99.73.10:3306) completed successfully.”,说明备选master现在已经上位了。并且可以看到两个slave中10.99.73.10成为了new master,而10.99.73.11成为了10.99.73.10的slave。这主要是因为我们刻意把10.99.73.11设置为了no_master,就是不参与选举。如果没有设置no_master的话,那么MHA在进行选择时会根据数据最接近于master的slave。
从上面的输出可以看出整个MHA的切换过程,共包括以下的步骤:
1. 配置文件检查阶段,这个阶段会检查整个集群配置文件配置,找出master和slave。
2. 宕机的master处理,这个阶段包括虚拟ip摘除操作,主机关机操作(可选)。
3. 找出含有最新数据的slave。
4. 复制dead master和最新slave相差的relay log,并保存到MHA Manger具体的目录下。
5. 找出并提升一个slave为新的master。
6. 应用从master保存的二进制日志事件(binlog events)到新的master。
7. 使其他的slave连接新的master进行复制。
6. 验证new master(10.99.73.10)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
mysql> select count(*) from sbtest.sbtest; +----------+ | count(*) | +----------+ | 1000000 | +----------+ 1 row in set (0.24 sec) mysql> show master status; +------------------+----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+----------+--------------+------------------+-------------------+ | mysql-bin.000002 | 62292198 | | | | +------------------+----------+--------------+------------------+-------------------+ 1 row in set (0.00 sec) mysql> show slave hosts; +-----------+-------------+------+-----------+--------------------------------------+ | Server_id | Host | Port | Master_id | Slave_UUID | +-----------+-------------+------+-----------+--------------------------------------+ | 103309 | 10.99.73.11 | 3306 | 2000 | c9c57719-edbf-11e6-8f8c-fa163ece854a | +-----------+-------------+------+-----------+--------------------------------------+ 1 row in set (0.00 sec) |
由此结果可以看出new master跟old master的数据一致,说明mha是根据old master来补齐new master的差异数据的。
7. 切换发送邮件
最后补充一下邮件发送脚本send_report ,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
#!/usr/bin/perl use strict; use warnings FATAL => 'all'; use Mail::Sender; use Getopt::Long; #new_master_host and new_slave_hosts are set only when recovering master succeeded my ( $dead_master_host, $new_master_host, $new_slave_hosts, $subject, $body ); my $smtp='smtp.163.com'; my $mail_from='xxxx'; my $mail_user='xxxxx'; my $mail_pass='xxxxx'; my $mail_to=['xxxx','xxxx']; GetOptions( 'orig_master_host=s' => \$dead_master_host, 'new_master_host=s' => \$new_master_host, 'new_slave_hosts=s' => \$new_slave_hosts, 'subject=s' => \$subject, 'body=s' => \$body, ); mailToContacts($smtp,$mail_from,$mail_user,$mail_pass,$mail_to,$subject,$body); sub mailToContacts { my ( $smtp, $mail_from, $user, $passwd, $mail_to, $subject, $msg ) = @_; open my $DEBUG, "> /tmp/monitormail.log" or die "Can't open the debug file:$!\n"; my $sender = new Mail::Sender { ctype => 'text/plain; charset=utf-8', encoding => 'utf-8', smtp => $smtp, from => $mail_from, auth => 'LOGIN', TLS_allowed => '0', authid => $user, authpwd => $passwd, to => $mail_to, subject => $subject, debug => $DEBUG }; $sender->MailMsg( { msg => $msg, debug => $DEBUG } ) or print $Mail::Sender::Error; return 1; } # Do whatever you want here exit 0; |
切换后报警信息如下:
三、手动Failover(MHA Manager必须没有运行)
首先搞一个干净的MySQL复制集群加mha监控(在mha监控端不需要开启mha manager)。手动failover,这种场景意味着在业务上没有启用MHA自动切换功能,当主服务器故障时,人工手动调用MHA来进行故障切换操作。注意:如果MHA manager检测到没有dead的server,将报错,并结束failover,如下报错信息:
1 2 3 4 5 6 |
Thu Feb 16 18:18:07 2017 - [info] * Phase 1: Configuration Check Phase.. Thu Feb 16 18:18:07 2017 - [info] Thu Feb 16 18:18:07 2017 - [info] GTID failover mode = 1 Thu Feb 16 18:18:07 2017 - [info] Dead Servers: Thu Feb 16 18:18:07 2017 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln187] None of server is dead. Stop failover. Thu Feb 16 18:18:07 2017 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/local/bin/masterha_master_switch line 53. |
进行手动切换命令如下,但是在切换之前我们需要模拟一下slave延迟,然后让切换脚本自动补全relay log。
在主库(10.99.73.9)上进行sysbench数据生成,在sbtest库下生成sbtest表,共100W记录。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
$ sysbench /usr/share/sysbench/oltp_read_only.lua \ --mysql-host=127.0.0.1 \ --mysql-port=3306 \ --mysql-user=mha \ --mysql-password=123456 \ --mysql-socket=/data/mysql/3306/mysql.sock \ --mysql-db=sbtest \ --db-driver=mysql \ --tables=1 \ --table-size=1000000 \ --report-interval=10 \ --threads=128 \ --time=120 \ prepare |
主库写入一些数据后,就可以关闭10.99.73.10的io_thread,保持比主库数据量小。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
mysql> stop slave io_thread; Query OK, 0 rows affected (0.09 sec) mysql> select count(*) from sbtest.sbtest; +----------+ | count(*) | +----------+ | 640000 | +----------+ 1 row in set (0.21 sec) mysql> show master status; +------------------+----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+----------+--------------+------------------+-------------------+ | mysql-bin.000001 | 39867231 | | | | +------------------+----------+--------------+------------------+-------------------+ 1 row in set (0.00 sec) |
然后就可以关闭10.99.73.9的io_thread,但是要保证10.99.73.9主机的数据比10.99.73.10数据多,但是比10.99.73.9的数据小。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
mysql> stop slave io_thread; Query OK, 0 rows affected (0.00 sec) mysql> select count(*) from sbtest.sbtest; +----------+ | count(*) | +----------+ | 700000 | +----------+ 1 row in set (0.19 sec) mysql> show master status; +------------------+----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+----------+--------------+------------------+-------------------+ | mysql-bin.000001 | 43604703 | | | | +------------------+----------+--------------+------------------+-------------------+ 1 row in set (0.00 sec) |
然后等主库压测完事之后查看一下主库状态就关闭mysql进程。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
$ mysql -S /data/mysql/3306/mysql.sock mysql> select count(*) from sbtest.sbtest; +----------+ | count(*) | +----------+ | 1000000 | +----------+ 1 row in set (0.23 sec) mysql> show master status; +------------------+-----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+-----------+--------------+------------------+-------------------+ | mysql-bin.000001 | 123299163 | | | | +------------------+-----------+--------------+------------------+-------------------+ 1 row in set (0.00 sec) |
1 |
$ mysqladmin -S /data/mysql/3306/mysql.sock shutdown |
然后就可以在mha主机上进行手动切换了,由于要指定特定的slave为候选master,而此slave还落后非常多,可以在每组服务器上都加上check_repl_delay=0表示忽略复制延迟,不然整个恢复过程会加长。如下配置:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
....... [server1] candidate_master=1 hostname=10.99.73.9 port=3306 check_repl_delay=0 [server2] candidate_master=1 hostname=10.99.73.10 port=3306 check_repl_delay=0 [server3] hostname=10.99.73.11 no_master=1 port=3306 check_repl_delay=0 |
手动切换:
1 |
$ masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=10.99.73.9 --dead_master_port=3306 --interactive=1 --new_master_host=10.99.73.10 |
会输出整个切换过程以及会询问你是否进行切换:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 |
--dead_master_ip=<dead_master_ip> is not set. Using 10.99.73.9. Mon Feb 20 15:52:49 2017 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Mon Feb 20 15:52:49 2017 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Mon Feb 20 15:52:49 2017 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Mon Feb 20 15:52:49 2017 - [info] MHA::MasterFailover version 0.57. Mon Feb 20 15:52:49 2017 - [info] Starting master failover. Mon Feb 20 15:52:49 2017 - [info] Mon Feb 20 15:52:49 2017 - [info] * Phase 1: Configuration Check Phase.. Mon Feb 20 15:52:49 2017 - [info] Mon Feb 20 15:52:49 2017 - [info] GTID failover mode = 0 Mon Feb 20 15:52:49 2017 - [info] Dead Servers: Mon Feb 20 15:52:49 2017 - [info] 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:52:49 2017 - [info] Checking master reachability via MySQL(double check)... Mon Feb 20 15:52:49 2017 - [info] ok. Mon Feb 20 15:52:49 2017 - [info] Alive Servers: Mon Feb 20 15:52:49 2017 - [info] 10.99.73.10(10.99.73.10:3306) Mon Feb 20 15:52:49 2017 - [info] 10.99.73.11(10.99.73.11:3306) Mon Feb 20 15:52:49 2017 - [info] Alive Slaves: Mon Feb 20 15:52:49 2017 - [info] 10.99.73.10(10.99.73.10:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:52:49 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:52:49 2017 - [info] Primary candidate for the new Master (candidate_master is set) Mon Feb 20 15:52:49 2017 - [info] 10.99.73.11(10.99.73.11:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:52:49 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:52:49 2017 - [info] Not candidate for the new Master (no_master is set) Master 10.99.73.9(10.99.73.9:3306) is dead. Proceed? (yes/NO): yes Mon Feb 20 15:52:50 2017 - [info] Starting Non-GTID based failover. Mon Feb 20 15:52:50 2017 - [info] Mon Feb 20 15:52:50 2017 - [info] ** Phase 1: Configuration Check Phase completed. Mon Feb 20 15:52:50 2017 - [info] Mon Feb 20 15:52:50 2017 - [info] * Phase 2: Dead Master Shutdown Phase.. Mon Feb 20 15:52:50 2017 - [info] Mon Feb 20 15:52:50 2017 - [info] HealthCheck: SSH to 10.99.73.9 is reachable. Mon Feb 20 15:52:51 2017 - [info] Forcing shutdown so that applications never connect to the current master.. Mon Feb 20 15:52:51 2017 - [warning] master_ip_failover_script is not set. Skipping invalidating dead master IP address. Mon Feb 20 15:52:51 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. Mon Feb 20 15:52:51 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed. Mon Feb 20 15:52:51 2017 - [info] Mon Feb 20 15:52:51 2017 - [info] * Phase 3: Master Recovery Phase.. Mon Feb 20 15:52:51 2017 - [info] Mon Feb 20 15:52:51 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase.. Mon Feb 20 15:52:51 2017 - [info] Mon Feb 20 15:52:51 2017 - [info] The latest binary log file/position on all slaves is mysql-bin.000001:123299163 Mon Feb 20 15:52:51 2017 - [info] Latest slaves (Slaves that received relay log files to the latest): Mon Feb 20 15:52:51 2017 - [info] 10.99.73.10(10.99.73.10:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:52:51 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:52:51 2017 - [info] Primary candidate for the new Master (candidate_master is set) Mon Feb 20 15:52:51 2017 - [info] The oldest binary log file/position on all slaves is mysql-bin.000001:86309812 Mon Feb 20 15:52:51 2017 - [info] Oldest slaves: Mon Feb 20 15:52:51 2017 - [info] 10.99.73.11(10.99.73.11:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:52:51 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:52:51 2017 - [info] Not candidate for the new Master (no_master is set) Mon Feb 20 15:52:51 2017 - [info] Mon Feb 20 15:52:51 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase.. Mon Feb 20 15:52:51 2017 - [info] Mon Feb 20 15:52:51 2017 - [info] Fetching dead master's binary logs.. Mon Feb 20 15:52:51 2017 - [info] Executing command on the dead master 10.99.73.9(10.99.73.9:3306): save_binary_logs --command=save --start_file=mysql-bin.000001 --start_pos=123299163 --binlog_dir=/data/mysql/log/binlog --output_file=/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 Creating /data/mysql/mha if not exists.. ok. Concat binary/relay logs from mysql-bin.000001 pos 123299163 to mysql-bin.000001 EOF into /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog .. Binlog Checksum enabled Dumping binlog format description event, from position 0 to 154.. ok. Dumping effective binlog data from /data/mysql/log/binlog/mysql-bin.000001 position 123299163 to tail(123299186).. ok. Binlog Checksum enabled Concat succeeded. saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog 100% 177 0.2KB/s 00:00 Mon Feb 20 15:52:51 2017 - [info] scp from root@10.99.73.9:/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog to local:/var/log/masterha/app1/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog succeeded. Mon Feb 20 15:52:51 2017 - [info] HealthCheck: SSH to 10.99.73.10 is reachable. Mon Feb 20 15:52:52 2017 - [info] HealthCheck: SSH to 10.99.73.11 is reachable. Mon Feb 20 15:52:52 2017 - [info] Mon Feb 20 15:52:52 2017 - [info] * Phase 3.3: Determining New Master Phase.. Mon Feb 20 15:52:52 2017 - [info] Mon Feb 20 15:52:52 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves.. Mon Feb 20 15:52:52 2017 - [info] Checking whether 10.99.73.10 has relay logs from the oldest position.. Mon Feb 20 15:52:52 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=mysql-bin.000001 --latest_rmlp=123299163 --target_mlf=mysql-bin.000001 --target_rmlp=86309812 --server_id=2000 --workdir=/data/mysql/mha --timestamp=20170220155249 --manager_version=0.57 --relay_dir=/data/mysql/3306/log/relaylog --current_relay_log=relay-log.000004 : Relay log found at /data/mysql/3306/log/relaylog, up to relay-log.000004 Fast relay log position search succeeded. Target relay log file/position found. start_file:relay-log.000004, start_pos:7398357. Target relay log FOUND! Mon Feb 20 15:52:52 2017 - [info] OK. 10.99.73.10 has all relay logs. Mon Feb 20 15:52:52 2017 - [info] 10.99.73.10 can be new master. Mon Feb 20 15:52:52 2017 - [info] New master is 10.99.73.10(10.99.73.10:3306) Mon Feb 20 15:52:52 2017 - [info] Starting master failover.. Mon Feb 20 15:52:52 2017 - [info] From: 10.99.73.9(10.99.73.9:3306) (current master) +--10.99.73.10(10.99.73.10:3306) +--10.99.73.11(10.99.73.11:3306) To: 10.99.73.10(10.99.73.10:3306) (new master) +--10.99.73.11(10.99.73.11:3306) Starting master switch from 10.99.73.9(10.99.73.9:3306) to 10.99.73.10(10.99.73.10:3306)? (yes/NO): yes Mon Feb 20 15:52:55 2017 - [info] New master decided manually is 10.99.73.10(10.99.73.10:3306) Mon Feb 20 15:52:55 2017 - [info] Mon Feb 20 15:52:55 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase.. Mon Feb 20 15:52:55 2017 - [info] Mon Feb 20 15:52:55 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Mon Feb 20 15:52:55 2017 - [info] Sending binlog.. saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog 100% 177 0.2KB/s 00:00 Mon Feb 20 15:52:55 2017 - [info] scp from local:/var/log/masterha/app1/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog to root@10.99.73.10:/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog succeeded. Mon Feb 20 15:52:55 2017 - [info] Mon Feb 20 15:52:55 2017 - [info] * Phase 3.4: Master Log Apply Phase.. Mon Feb 20 15:52:55 2017 - [info] Mon Feb 20 15:52:55 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed. Mon Feb 20 15:52:55 2017 - [info] Starting recovery on 10.99.73.10(10.99.73.10:3306).. Mon Feb 20 15:52:55 2017 - [info] Generating diffs succeeded. Mon Feb 20 15:52:55 2017 - [info] Waiting until all relay logs are applied. Mon Feb 20 15:52:55 2017 - [info] done. Mon Feb 20 15:52:55 2017 - [info] Getting slave status.. Mon Feb 20 15:52:55 2017 - [info] This slave(10.99.73.10)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000001:123299163). No need to recover from Exec_Master_Log_Pos. Mon Feb 20 15:52:55 2017 - [info] Connecting to the target slave host 10.99.73.10, running recover script.. Mon Feb 20 15:52:55 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='mha' --slave_host=10.99.73.10 --slave_ip=10.99.73.10 --slave_port=3306 --apply_files=/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog --workdir=/data/mysql/mha --target_version=5.7.17-log --timestamp=20170220155249 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 --slave_pass=xxx Mon Feb 20 15:52:55 2017 - [info] MySQL client version is 5.7.17. Using --binary-mode. Applying differential binary/relay log files /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog on 10.99.73.10:3306. This may take long time... Applying log files succeeded. Mon Feb 20 15:52:55 2017 - [info] All relay logs were successfully applied. Mon Feb 20 15:52:55 2017 - [info] Getting new master's binlog name and position.. Mon Feb 20 15:52:55 2017 - [info] mysql-bin.000002:22424986 Mon Feb 20 15:52:55 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.99.73.10', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000002', MASTER_LOG_POS=22424986, MASTER_USER='mysql_slave', MASTER_PASSWORD='xxx'; Mon Feb 20 15:52:55 2017 - [warning] master_ip_failover_script is not set. Skipping taking over new master IP address. Mon Feb 20 15:52:55 2017 - [info] ** Finished master recovery successfully. Mon Feb 20 15:52:55 2017 - [info] * Phase 3: Master Recovery Phase completed. Mon Feb 20 15:52:55 2017 - [info] Mon Feb 20 15:52:55 2017 - [info] * Phase 4: Slaves Recovery Phase.. Mon Feb 20 15:52:55 2017 - [info] Mon Feb 20 15:52:55 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. Mon Feb 20 15:52:55 2017 - [info] Mon Feb 20 15:52:55 2017 - [info] -- Slave diff file generation on host 10.99.73.11(10.99.73.11:3306) started, pid: 10463. Check tmp log /var/log/masterha/app1/10.99.73.11_3306_20170220155249.log if it takes time.. Mon Feb 20 15:52:57 2017 - [info] Mon Feb 20 15:52:57 2017 - [info] Log messages from 10.99.73.11 ... Mon Feb 20 15:52:57 2017 - [info] Mon Feb 20 15:52:55 2017 - [info] Server 10.99.73.11 received relay logs up to: mysql-bin.000001:86309812 Mon Feb 20 15:52:55 2017 - [info] Need to get diffs from the latest slave(10.99.73.10) up to: mysql-bin.000001:123299163 (using the latest slave's relay logs) Mon Feb 20 15:52:56 2017 - [info] Connecting to the latest slave host 10.99.73.10, generating diff relay log files.. Mon Feb 20 15:52:56 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=10.99.73.11 --latest_mlf=mysql-bin.000001 --latest_rmlp=123299163 --target_mlf=mysql-bin.000001 --target_rmlp=86309812 --server_id=2000 --diff_file_readtolatest=/data/mysql/mha/relay_from_read_to_latest_10.99.73.11_3306_20170220155249.binlog --workdir=/data/mysql/mha --timestamp=20170220155249 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 --relay_dir=/data/mysql/3306/log/relaylog --current_relay_log=relay-log.000004 Mon Feb 20 15:52:57 2017 - [info] Relay log found at /data/mysql/3306/log/relaylog, up to relay-log.000004 Fast relay log position search succeeded. Target relay log file/position found. start_file:relay-log.000004, start_pos:7398357. Concat binary/relay logs from relay-log.000004 pos 7398357 to relay-log.000004 EOF into /data/mysql/mha/relay_from_read_to_latest_10.99.73.11_3306_20170220155249.binlog .. Binlog Checksum enabled Binlog Checksum enabled Dumping binlog format description event, from position 0 to 320.. ok. Dumping effective binlog data from /data/mysql/3306/log/relaylog/relay-log.000004 position 7398357 to tail(44387708).. ok. Binlog Checksum enabled Binlog Checksum enabled Concat succeeded. Generating diff relay log succeeded. Saved at /data/mysql/mha/relay_from_read_to_latest_10.99.73.11_3306_20170220155249.binlog . scp mysql02:/data/mysql/mha/relay_from_read_to_latest_10.99.73.11_3306_20170220155249.binlog to root@10.99.73.11(22) succeeded. Mon Feb 20 15:52:57 2017 - [info] Generating diff files succeeded. Mon Feb 20 15:52:57 2017 - [info] End of log messages from 10.99.73.11. Mon Feb 20 15:52:57 2017 - [info] -- Slave diff log generation on host 10.99.73.11(10.99.73.11:3306) succeeded. Mon Feb 20 15:52:57 2017 - [info] Generating relay diff files from the latest slave succeeded. Mon Feb 20 15:52:57 2017 - [info] Mon Feb 20 15:52:57 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase.. Mon Feb 20 15:52:57 2017 - [info] Mon Feb 20 15:52:57 2017 - [info] -- Slave recovery on host 10.99.73.11(10.99.73.11:3306) started, pid: 10469. Check tmp log /var/log/masterha/app1/10.99.73.11_3306_20170220155249.log if it takes time.. saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog 100% 177 0.2KB/s 00:00 Mon Feb 20 15:53:01 2017 - [info] Mon Feb 20 15:53:01 2017 - [info] Log messages from 10.99.73.11 ... Mon Feb 20 15:53:01 2017 - [info] Mon Feb 20 15:52:57 2017 - [info] Sending binlog.. Mon Feb 20 15:52:57 2017 - [info] scp from local:/var/log/masterha/app1/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog to root@10.99.73.11:/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog succeeded. Mon Feb 20 15:52:57 2017 - [info] Starting recovery on 10.99.73.11(10.99.73.11:3306).. Mon Feb 20 15:52:57 2017 - [info] Generating diffs succeeded. Mon Feb 20 15:52:57 2017 - [info] Waiting until all relay logs are applied. Mon Feb 20 15:52:57 2017 - [info] done. Mon Feb 20 15:52:57 2017 - [info] Getting slave status.. Mon Feb 20 15:52:57 2017 - [info] This slave(10.99.73.11)'s Exec_Master_Log_Pos(mysql-bin.000001:86309673) does not equal to Read_Master_Log_Pos(mysql-bin.000001:86309812). It is likely that relay log was cut during transaction. Need to recover from Exec_Master_Log_Pos. Mon Feb 20 15:52:57 2017 - [info] Saving local relay logs from exec pos to read pos on 10.99.73.11: from relay-log.000002:86309886 to the end of the relay log.. Mon Feb 20 15:52:57 2017 - [info] Executing command : save_binary_logs --command=save --start_file=relay-log.000002 --start_pos=86309886 --output_file=/data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220155249.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 --binlog_dir=/data/mysql/3306/log/relaylog Mon Feb 20 15:52:57 2017 - [info] Creating /data/mysql/mha if not exists.. ok. Concat binary/relay logs from relay-log.000002 pos 86309886 to relay-log.000002 EOF into /data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220155249.binlog .. Binlog Checksum enabled Binlog Checksum enabled Dumping binlog format description event, from position 0 to 367.. ok. Dumping effective binlog data from /data/mysql/3306/log/relaylog/relay-log.000002 position 86309886 to tail(86310025).. ok. Binlog Checksum enabled Binlog Checksum enabled Concat succeeded. Mon Feb 20 15:52:57 2017 - [info] Connecting to the target slave host 10.99.73.11, running recover script.. Mon Feb 20 15:52:57 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='mha' --slave_host=10.99.73.11 --slave_ip=10.99.73.11 --slave_port=3306 --apply_files=/data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220155249.binlog,/data/mysql/mha/relay_from_read_to_latest_10.99.73.11_3306_20170220155249.binlog, /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog --workdir=/data/mysql/mha --target_version=5.7.17-log --timestamp=20170220155249 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 --slave_pass=xxx Mon Feb 20 15:53:01 2017 - [info] Concat all apply files to /data/mysql/mha/total_binlog_for_10.99.73.11_3306.20170220155249.binlog .. Copying the first binlog file /data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220155249.binlog to /data/mysql/mha/total_binlog_for_10.99.73.11_3306.20170220155249.binlog.. ok. Dumping binlog head events (rotate events), skipping format description events from /data/mysql/mha/relay_from_read_to_latest_10.99.73.11_3306_20170220155249.binlog.. Binlog Checksum enabled Binlog Checksum enabled dumped up to pos 320. ok. /data/mysql/mha/relay_from_read_to_latest_10.99.73.11_3306_20170220155249.binlog has effective binlog events from pos 320. Dumping effective binlog data from /data/mysql/mha/relay_from_read_to_latest_10.99.73.11_3306_20170220155249.binlog position 320 to tail(36989671).. ok. Dumping binlog head events (rotate events), skipping format description events from /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog.. Binlog Checksum enabled dumped up to pos 154. ok. /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog has effective binlog events from pos 154. Dumping effective binlog data from /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog position 154 to tail(177).. ok. Concat succeeded. All apply target binary logs are concatinated at /data/mysql/mha/total_binlog_for_10.99.73.11_3306.20170220155249.binlog . MySQL client version is 5.7.17. Using --binary-mode. Applying differential binary/relay log files /data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220155249.binlog,/data/mysql/mha/relay_from_read_to_latest_10.99.73.11_3306_20170220155249.binlog,/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220155249.binlog on 10.99.73.11:3306. This may take long time... Applying log files succeeded. Mon Feb 20 15:53:01 2017 - [info] All relay logs were successfully applied. Mon Feb 20 15:53:01 2017 - [info] Resetting slave 10.99.73.11(10.99.73.11:3306) and starting replication from the new master 10.99.73.10(10.99.73.10:3306).. Mon Feb 20 15:53:01 2017 - [info] Executed CHANGE MASTER. Mon Feb 20 15:53:01 2017 - [info] Slave started. Mon Feb 20 15:53:01 2017 - [info] End of log messages from 10.99.73.11. Mon Feb 20 15:53:01 2017 - [info] -- Slave recovery on host 10.99.73.11(10.99.73.11:3306) succeeded. Mon Feb 20 15:53:01 2017 - [info] All new slave servers recovered successfully. Mon Feb 20 15:53:01 2017 - [info] Mon Feb 20 15:53:01 2017 - [info] * Phase 5: New master cleanup phase.. Mon Feb 20 15:53:01 2017 - [info] Mon Feb 20 15:53:01 2017 - [info] Resetting slave info on the new master.. Mon Feb 20 15:53:01 2017 - [info] 10.99.73.10: Resetting slave info succeeded. Mon Feb 20 15:53:01 2017 - [info] Master failover to 10.99.73.10(10.99.73.10:3306) completed successfully. Mon Feb 20 15:53:01 2017 - [info] ----- Failover Report ----- app1: MySQL Master failover 10.99.73.9(10.99.73.9:3306) to 10.99.73.10(10.99.73.10:3306) succeeded Master 10.99.73.9(10.99.73.9:3306) is down! Check MHA Manager logs at mha for details. Started manual(interactive) failover. The latest slave 10.99.73.10(10.99.73.10:3306) has all relay logs for recovery. Selected 10.99.73.10(10.99.73.10:3306) as a new master. 10.99.73.10(10.99.73.10:3306): OK: Applying all logs succeeded. 10.99.73.11(10.99.73.11:3306): Generating differential relay logs up to 10.99.73.10(10.99.73.10:3306)succeeded. Generating relay diff files from the latest slave succeeded. 10.99.73.11(10.99.73.11:3306): OK: Applying all logs succeeded. Slave started, replicating from 10.99.73.10(10.99.73.10:3306) 10.99.73.10(10.99.73.10:3306): Resetting slave info succeeded. Master failover to 10.99.73.10(10.99.73.10:3306) completed successfully. Mon Feb 20 15:53:01 2017 - [info] Sending mail.. |
上述模拟了master宕机的情况下手动把10.99.73.10提升为主库的操作过程。
查看一下10.99.73.10主机状态。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
mysql> select count(*) from sbtest.sbtest; +----------+ | count(*) | +----------+ | 1000000 | +----------+ 1 row in set (0.28 sec) mysql> show master status; +------------------+----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+----------+--------------+------------------+-------------------+ | mysql-bin.000002 | 22424986 | | | | +------------------+----------+--------------+------------------+-------------------+ 1 row in set (0.00 sec) mysql> show slave hosts; +-----------+-------------+------+-----------+--------------------------------------+ | Server_id | Host | Port | Master_id | Slave_UUID | +-----------+-------------+------+-----------+--------------------------------------+ | 103309 | 10.99.73.11 | 3306 | 2000 | c9c57719-edbf-11e6-8f8c-fa163ece854a | +-----------+-------------+------+-----------+--------------------------------------+ 1 row in set (0.00 sec) |
可以看到10.99.73.10切换为主后,自动补全了old master所有的差异日志。
masterha_master_switch命令参数介绍
--master_state=dead
参数取值:dead or alive。如果设置为alive,就是在线master切换了,这样的场景,master必须是活着的。
--dead_master_host=(hostname)
Dead master的主机信息,包括--dead_master_ip
,--dead_master_port
。
--new_master_host=(hostname)
New master的主机信息,这个参数是可选项,如果你想特意指定某台机器作为new master,就设置这个参数。如果--new_master_host
没有设置,那么选举master的规则参考automated master failover( candidate_master parameter )。
--interactive=(0|1)
交互式failover,设置为1(默认)。非交互式failover,设置为0。
--skip_change_master(0.56)
只会完成日志补偿,不会进行change master和start slave。如果你想double check slave是否成功的恢复完成,那么设置该参数对你特别有用。
--skip_disable_read_only
如果设置了这个参数,那么新master将还会是只读状态。如果你想手动开启新master的写权限,那么这个参数特别有用。
--last_failover_minute=(minutes)
同masterha_manager里面的参数一样 。
--ignore_last_failover
同masterha_manager里面的参数一样 。
--wait_on_failover_error=(seconds)
同masterha_manager里面的参数一样。这个参数,只对自动或者非交互式的failover有用,如果--interactive=0
没有设置,那么这个参数将不起作用。
--remove_dead_master_conf
同masterha_manager里面的参数一样。
--wait_until_gtid_in_sync=(0|1)
这是基于GTID的参数,如果设置为1(默认): MHA会等待,直到所有slave追上新master的gtid。如果设置为0: 那么MHA不会等待slave追上新master。
--ignore_binlog_server_error
MHA忽略任何binlog server的错误。
四、在线进行切换
在许多情况下,需要将现有的主服务器迁移到另外一台服务器上。比如主服务器硬件故障,RAID控制卡需要重建,将主服务器移到性能更好的服务器上等等。维护主服务器引起性能下降, 导致停机时间至少无法写入数据。 另外, 阻塞或杀掉当前运行的会话会导致主主之间数据不一致的问题发生。 MHA 提供快速切换和优雅的阻塞写入,这个切换过程只需要0.5-2s的时间,这段时间内数据是无法写入的。在很多情况下,0.5-2的阻塞写入是可以接受的。因此切换主服务器不需要计划分配维护时间窗口。
MHA在线切换的大概过程:
1. 检查配置文件,主要检测复制设置和确定当前主服务器。
2. 选出新的主服务器。
3. 阻塞当前master的写入操作,主要包括执行 FLUSH TABLES,设置 read_only,kill 所有业务线程,关闭 VIP,最后执行 FLUSH TABLES WITH READ LOCK;一切都是为了保证新的写入无法进入,正在操作的事务回滚。
4. 等待所有从服务器复制完成,然后获取从服务器 POSITION 点(SHOW MASTER STATUS)。
5. 允许写入操作到新的主服务器,主要包括设置 read_only=OFF,RESET SLAVE ALL,设置 VIP。
6. 重新设置老的 master 为 slave,主要包括设置 relay_log_purge=0,change master,unlock tables,start slave 这几个部分。
注意,在线切换的时候应用架构需要考虑以下两个问题:
1. 自动识别master和slave的问题(master的机器可能会切换),如果采用了vip的方式,基本可以解决这个问题。
2. 负载均衡的问题,可以定义大概的读写比例,每台机器可承担的负载比例,当有机器离开集群时,需要考虑这个问题。
为了保证数据完全一致性,在最快的时间内完成切换,MHA的在线切换必须满足以下条件才会切换成功,否则会切换失败。
1. 所有slave的IO线程都在运行。
2. 所有slave的SQL线程都在运行。
3. 所有的show slave status的输出中Seconds_Behind_Master参数小于或者等于running_updates_limit秒,如果在切换过程中不指定running_updates_limit,那么默认情况下running_updates_limit为1秒。
4. 在master端,通过show processlist输出,没有一个更新花费的时间大于running_updates_limit秒。
在线切换步骤如下:
首先,停掉MHA监控:
1 |
[root@mha ~]# masterha_stop --conf=/etc/masterha/app1.cnf |
然后可以在主库(10.99.73.9)上进行sysbench数据生成,在sbtest库下生成sbtest表,共100W记录。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
$ sysbench /usr/share/sysbench/oltp_read_only.lua \ --mysql-host=127.0.0.1 \ --mysql-port=3306 \ --mysql-user=mha \ --mysql-password=123456 \ --mysql-socket=/data/mysql/3306/mysql.sock \ --mysql-db=sbtest \ --db-driver=mysql \ --tables=1 \ --table-size=1000000 \ --report-interval=10 \ --threads=128 \ --time=120 \ prepare |
有条件的可以在压测时连接数据指定VIP地址,把socket去掉,添加--mysql-host=10.99.73.100
,这就样可以模拟应用在线切换(一边压测一边切换)。
等主库一压测完后。就可以进行在线切换操作(模拟在线切换主库操作,原主库10.99.73.9变为slave,10.99.73.10提升为新的主库)。
1 |
[root@mha ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=10.99.73.10 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=1 |
其中参数的意思:
--orig_master_is_new_slave
切换时加上此参数是将原master变为slave节点,如果不加此参数,原来的master将不启动。
--running_updates_limit
故障切换时,候选master如果有延迟的话,mha切换不能成功,加上此参数表示延迟在此时间范围内都可切换(单位为s),但是切换的时间长短是由recover时relay日志的大小决定。
--new_master_host & --new_master_port
选择把哪个Slave提升为New Master,如果只有一个Slave节点,此参数可不加,会自动把Slave节点提升New Master节点。
最后查看日志,了解切换过程,最后输出信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
Mon Feb 20 16:15:41 2017 - [info] MHA::MasterRotate version 0.57. Mon Feb 20 16:15:41 2017 - [info] Starting online master switch.. Mon Feb 20 16:15:41 2017 - [info] Mon Feb 20 16:15:41 2017 - [info] * Phase 1: Configuration Check Phase.. Mon Feb 20 16:15:41 2017 - [info] Mon Feb 20 16:15:41 2017 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Mon Feb 20 16:15:41 2017 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Mon Feb 20 16:15:41 2017 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Mon Feb 20 16:15:41 2017 - [info] GTID failover mode = 0 Mon Feb 20 16:15:41 2017 - [info] Current Alive Master: 10.99.73.9(10.99.73.9:3306) Mon Feb 20 16:15:41 2017 - [info] Alive Slaves: Mon Feb 20 16:15:41 2017 - [info] 10.99.73.10(10.99.73.10:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 16:15:41 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 16:15:41 2017 - [info] Primary candidate for the new Master (candidate_master is set) Mon Feb 20 16:15:41 2017 - [info] 10.99.73.11(10.99.73.11:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 16:15:41 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 16:15:41 2017 - [info] Not candidate for the new Master (no_master is set) It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 10.99.73.9(10.99.73.9:3306)? (YES/no): yes Mon Feb 20 16:15:59 2017 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time.. Mon Feb 20 16:15:59 2017 - [info] ok. Mon Feb 20 16:15:59 2017 - [info] Checking MHA is not monitoring or doing failover.. Mon Feb 20 16:15:59 2017 - [info] Checking replication health on 10.99.73.10.. Mon Feb 20 16:15:59 2017 - [info] ok. Mon Feb 20 16:15:59 2017 - [info] Checking replication health on 10.99.73.11.. Mon Feb 20 16:15:59 2017 - [info] ok. Mon Feb 20 16:15:59 2017 - [info] 10.99.73.10 can be new master. Mon Feb 20 16:15:59 2017 - [info] From: 10.99.73.9(10.99.73.9:3306) (current master) +--10.99.73.10(10.99.73.10:3306) +--10.99.73.11(10.99.73.11:3306) To: 10.99.73.10(10.99.73.10:3306) (new master) +--10.99.73.11(10.99.73.11:3306) +--10.99.73.9(10.99.73.9:3306) Starting master switch from 10.99.73.9(10.99.73.9:3306) to 10.99.73.10(10.99.73.10:3306)? (yes/NO): yes Mon Feb 20 16:16:03 2017 - [info] Checking whether 10.99.73.10(10.99.73.10:3306) is ok for the new master.. Mon Feb 20 16:16:03 2017 - [info] ok. Mon Feb 20 16:16:03 2017 - [info] 10.99.73.9(10.99.73.9:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host. Mon Feb 20 16:16:04 2017 - [info] 10.99.73.9(10.99.73.9:3306): Resetting slave pointing to the dummy host. Mon Feb 20 16:16:04 2017 - [info] ** Phase 1: Configuration Check Phase completed. Mon Feb 20 16:16:04 2017 - [info] Mon Feb 20 16:16:04 2017 - [info] * Phase 2: Rejecting updates Phase.. Mon Feb 20 16:16:04 2017 - [info] Mon Feb 20 16:16:04 2017 - [info] Executing master ip online change script to disable write on the current master: Mon Feb 20 16:16:04 2017 - [info] /usr/local/bin/master_ip_online_change --command=stop --orig_master_host=10.99.73.9 --orig_master_ip=10.99.73.9 --orig_master_port=3306 --orig_master_user='mha' --orig_master_password='123456' --new_master_host=10.99.73.10 --new_master_ip=10.99.73.10 --new_master_port=3306 --new_master_user='mha' --new_master_password='123456' --orig_master_ssh_user=root --new_master_ssh_user=root --orig_master_is_new_slave Unknown option: orig_master_password Unknown option: new_master_password Unknown option: orig_master_ssh_user Unknown option: new_master_ssh_user Unknown option: orig_master_is_new_slave Mon Feb 20 16:16:04 2017 113836 Set read_only on the new master.. ok. Mon Feb 20 16:16:04 2017 116266 Set read_only=1 on the orig master.. ok. Disabling the VIP on old master: 10.99.73.9 SIOCSIFFLAGS: Cannot assign requested address Mon Feb 20 16:16:04 2017 240935 Killing all application threads.. Mon Feb 20 16:16:04 2017 240993 done. Mon Feb 20 16:16:04 2017 - [info] ok. Mon Feb 20 16:16:04 2017 - [info] Locking all tables on the orig master to reject updates from everybody (including root): Mon Feb 20 16:16:04 2017 - [info] Executing FLUSH TABLES WITH READ LOCK.. Mon Feb 20 16:16:04 2017 - [info] ok. Mon Feb 20 16:16:04 2017 - [info] Orig master binlog:pos is mysql-bin.000001:154. Mon Feb 20 16:16:04 2017 - [info] Waiting to execute all relay logs on 10.99.73.10(10.99.73.10:3306).. Mon Feb 20 16:16:04 2017 - [info] master_pos_wait(mysql-bin.000001:154) completed on 10.99.73.10(10.99.73.10:3306). Executed 0 events. Mon Feb 20 16:16:04 2017 - [info] done. Mon Feb 20 16:16:04 2017 - [info] Getting new master's binlog name and position.. Mon Feb 20 16:16:04 2017 - [info] mysql-bin.000001:154 Mon Feb 20 16:16:04 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.99.73.10', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=154, MASTER_USER='mysql_slave', MASTER_PASSWORD='xxx'; Mon Feb 20 16:16:04 2017 - [info] Executing master ip online change script to allow write on the new master: Mon Feb 20 16:16:04 2017 - [info] /usr/local/bin/master_ip_online_change --command=start --orig_master_host=10.99.73.9 --orig_master_ip=10.99.73.9 --orig_master_port=3306 --orig_master_user='mha' --orig_master_password='123456' --new_master_host=10.99.73.10 --new_master_ip=10.99.73.10 --new_master_port=3306 --new_master_user='mha' --new_master_password='123456' --orig_master_ssh_user=root --new_master_ssh_user=root --orig_master_is_new_slave Unknown option: orig_master_password Unknown option: new_master_password Unknown option: orig_master_ssh_user Unknown option: new_master_ssh_user Unknown option: orig_master_is_new_slave Mon Feb 20 16:16:04 2017 358995 Set read_only=0 on the new master. Enabling the VIP - 10.99.73.100/24 on the new master - 10.99.73.10 Mon Feb 20 16:16:04 2017 - [info] ok. Mon Feb 20 16:16:04 2017 - [info] Mon Feb 20 16:16:04 2017 - [info] * Switching slaves in parallel.. Mon Feb 20 16:16:04 2017 - [info] Mon Feb 20 16:16:04 2017 - [info] -- Slave switch on host 10.99.73.11(10.99.73.11:3306) started, pid: 10883 Mon Feb 20 16:16:04 2017 - [info] Mon Feb 20 16:16:04 2017 - [info] Log messages from 10.99.73.11 ... Mon Feb 20 16:16:04 2017 - [info] Mon Feb 20 16:16:04 2017 - [info] Waiting to execute all relay logs on 10.99.73.11(10.99.73.11:3306).. Mon Feb 20 16:16:04 2017 - [info] master_pos_wait(mysql-bin.000001:154) completed on 10.99.73.11(10.99.73.11:3306). Executed 0 events. Mon Feb 20 16:16:04 2017 - [info] done. Mon Feb 20 16:16:04 2017 - [info] Resetting slave 10.99.73.11(10.99.73.11:3306) and starting replication from the new master 10.99.73.10(10.99.73.10:3306).. Mon Feb 20 16:16:04 2017 - [info] Executed CHANGE MASTER. Mon Feb 20 16:16:04 2017 - [info] Slave started. Mon Feb 20 16:16:04 2017 - [info] End of log messages from 10.99.73.11 ... Mon Feb 20 16:16:04 2017 - [info] Mon Feb 20 16:16:04 2017 - [info] -- Slave switch on host 10.99.73.11(10.99.73.11:3306) succeeded. Mon Feb 20 16:16:04 2017 - [info] Unlocking all tables on the orig master: Mon Feb 20 16:16:04 2017 - [info] Executing UNLOCK TABLES.. Mon Feb 20 16:16:04 2017 - [info] ok. Mon Feb 20 16:16:04 2017 - [info] Starting orig master as a new slave.. Mon Feb 20 16:16:04 2017 - [info] Resetting slave 10.99.73.9(10.99.73.9:3306) and starting replication from the new master 10.99.73.10(10.99.73.10:3306).. Mon Feb 20 16:16:04 2017 - [info] Executed CHANGE MASTER. Mon Feb 20 16:16:04 2017 - [info] Slave started. Mon Feb 20 16:16:04 2017 - [info] All new slave servers switched successfully. Mon Feb 20 16:16:04 2017 - [info] Mon Feb 20 16:16:04 2017 - [info] * Phase 5: New master cleanup phase.. Mon Feb 20 16:16:04 2017 - [info] Mon Feb 20 16:16:04 2017 - [info] 10.99.73.10: Resetting slave info succeeded. Mon Feb 20 16:16:04 2017 - [info] Switching master to 10.99.73.10(10.99.73.10:3306) completed successfully. |
整个切换过程基本与我们上面说的步骤一致,这里面用到了一个master_pos_wait函数,语法:select master_pos_wait(file, pos[, timeout])。一般在主库执行,作为主从切换的判断依据,判断一下返回值>=0,则认为主从同步完成。这里的file和pos对应主库show master status得到的值,代表执行位置。函数逻辑是等待当前从库达到这个位置后返回, 返回期间执行的事务个数。参数timeout可选,若缺省则无限等待,timeout<=0时与缺省的逻辑相同。若为正数,则等待这么多秒,超时函数返回-1。其他返回值,若当前slave为启动或在等待期间被终止,返回NULL;若指定的值已经在之前达到,返回0。
查看10.99.73.10主机切换为master了。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
mysql> show slave hosts; +-----------+-------------+------+-----------+--------------------------------------+ | Server_id | Host | Port | Master_id | Slave_UUID | +-----------+-------------+------+-----------+--------------------------------------+ | 103309 | 10.99.73.11 | 3306 | 2000 | c9c57719-edbf-11e6-8f8c-fa163ece854a | | 10 | | 3306 | 2000 | 0f7c2151-edc0-11e6-ab9e-fa163e2ebbf0 | +-----------+-------------+------+-----------+--------------------------------------+ 2 rows in set (0.00 sec) mysql> select count(*) from sbtest.sbtest; +----------+ | count(*) | +----------+ | 1000000 | +----------+ 1 row in set (0.20 sec) |
MHA online切换主从状态必须正常,切换后old maser会变成slave,且为read_only模式。
五、二次检测测试
MHA官方提供了一个二次检测脚本,防止误切换。也就是说当MHA主机探测不到Master时,它会调用secondary_check_script脚本进行二次检测Master,如果都探测不到,才进行故障切换。
我的测试方法如下,利用iptables。先拒绝掉MHA访问Master的3306端口,然后看会不会发生切换。
1 2 3 4 |
$ iptables -A INPUT -s 10.99.73.7/32 -p tcp --dport 3306 -j REJECT $ iptables -A INPUT -s 10.99.73.7/32 -p tcp --dport 22 -j REJECT $ iptables -A INPUT -s 10.99.73.10/32 -p tcp --dport 22 -j REJECT $ iptables -A INPUT -s 10.99.73.11/32 -p tcp --dport 22 -j REJECT |
验证结果并没有发送切换,因为此时MHA会调secondary_check_script脚本进行二次检测Master,发现Master处于存活状态。
接着屏蔽二次检测指定的主机,如下:
1 2 |
$ iptables -A INPUT -s 10.99.73.10/32 -p tcp --dport 3306 -j REJECT $ iptables -A INPUT -s 10.99.73.11/32 -p tcp --dport 3306 -j REJECT |
验证结果,此时真正触发了MHA故障切换,因为满足条件,MHA确定Master已经故障。
六、修复宕机的Master
通常情况下自动切换以后,原master可能已经废弃掉,待原master主机修复后,如果数据完整的情况下,可能想把原来master重新作为新主库的slave,这时我们可以借助当时自动切换时刻的MHA日志来完成对原master的修复。下面是提取相关日志的命令:
1 2 3 |
[root@mysql02 app1]# grep -i "All other slaves should start" manager.log Mon Apr 21 22:28:33 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.0.60', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000022', MASTER_LOG_POS=506716, MASTER_USER='repl', MASTER_PASSWORD='xxx'; |
获取上述信息以后,就可以直接在修复后的master上执行change master to相关操作,重新作为从库了。