DRBD 故障测试及脑裂处理
2014-06-27 by dongnan
环境
请参考 使用 DRBD 与 Heartbeat 实现 Mysql 高可用
故障测试
模拟pn1主节点故障,测试pn2备份节点能否成功接管:
sync && init 6
测试结果
vip、drbd、mysql 被 pn2备份节点成功接管。
vip
ifconfig eth0:0
eth0:0    Link encap:Ethernet  HWaddr 00:50:56:9C:00:0D
inet addr:172.27.233.48  Bcast:172.27.233.255  Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
drbd
mount | tail -n1
/dev/drbd0 on /mysqldata type ext4 (rw)
mysql
mysqladmin ping
mysqld is alive
测试过程
- 第1次测试,关闭PN1主节点vip/drbd/mysqld自动切换到PN2,PN1重启后drbd状态变成secondary。
- 第2次测试,关闭PN2备节点vip/drbd/mysqld自动切换到PN1,PN2重启后drbd状态变成secondary。
- 第3次测试,关闭PN1主节点的mysqld服务,没有触发切换。
- 第4次测试,关闭PN1主节点的drbd服务,没有触发切换。
- 第5次测试,关闭PN1主节点的heartbeat服务,自动切换到PN2备节点。
- 第6次测试,PN2备节点直接拔电源,自动切换到PN1主节点。
DRBD脑裂
什么情况下DRBD会发生脑裂?
当drbd两个节点的角色都是Primary时,会发生脑裂。
可能导致脑裂的行为?
- 心跳设备出现故障,导致heartbeat认为对方节点死亡DRBD角色切换到Primary, 待心跳设备恢复两个DRBD节点都是Primary角色则DRBD发生脑裂。
- 误操作设置两个节点都是Primary角色则DRBD发生脑裂。
节点维护
日常维护
关闭 heartbeat 服务,如果是Primary节点自动释放资源,维护完毕启动 heartbeat 服务。
全体维护
- 先关闭 Secondary角色,再关闭Primary角色。
- 维护完毕,启动顺序任意,最后haresources中定义的主机为Primary。
测试
DRBD状态
# PN1主节点
cat /proc/drbd
version: 8.4.0 (api:1/proto:86-100)
GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by root@pn1, 2013-12-06 14:48:27
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:360 nr:80 dw:440 dr:22957 al:6 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
# PN2备节点
cat /proc/drbd
version: 8.4.0 (api:1/proto:86-100)
GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by root@pn2, 2013-12-06 15:08:20
 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:80 nr:408 dw:408 dr:80 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
模拟DRBD脑裂
拔掉PN1网线,模拟心跳设备出现故障,最后再插入网线:
脑裂状态
# PN1主节点
cat /proc/drbd
version: 8.4.0 (api:1/proto:86-100)
GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by root@pn1, 2013-12-06 14:48:27
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
    ns:0 nr:108 dw:348 dr:4305 al:11 bm:3 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:192
# PN2备节点
cat /proc/drbd
version: 8.4.0 (api:1/proto:86-100)
GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by root@pn2, 2013-12-06 15:08:20
 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r-----
    ns:0 nr:0 dw:256 dr:4065 al:6 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:20512
注意,主备节点都是 Primary 角色。
DRBD日志
PN1主节点
Dec 18 17:09:43 pn1 kernel: block drbd0: Split-Brain detected but unresolved, dropping connection!
Dec 18 17:09:43 pn1 kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0
Dec 18 17:09:43 pn1 kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
Dec 18 17:09:43 pn1 kernel: d-con mysql: conn( WFReportParams -> Disconnecting )
Dec 18 17:09:43 pn1 kernel: d-con mysql: error receiving ReportState, e: -5 l: 0!
Dec 18 17:09:43 pn1 kernel: d-con mysql: meta connection shut down by peer.
Dec 18 17:09:43 pn1 kernel: d-con mysql: asender terminated
Dec 18 17:09:43 pn1 kernel: d-con mysql: Terminating asender thread
Dec 18 17:09:43 pn1 kernel: d-con mysql: Connection closed
Dec 18 17:09:43 pn1 kernel: d-con mysql: conn( Disconnecting -> StandAlone )
Dec 18 17:09:43 pn1 kernel: d-con mysql: receiver terminated
Dec 18 17:09:43 pn1 kernel: d-con mysql: Terminating receiver thread
PN2备节点
Dec 18 17:09:43 pn2 kernel: block drbd0: Split-Brain detected but unresolved, dropping connection!
Dec 18 17:09:43 pn2 kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0
Dec 18 17:09:43 pn2 kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
Dec 18 17:09:43 pn2 kernel: d-con mysql: conn( WFReportParams -> Disconnecting )
Dec 18 17:09:43 pn2 kernel: d-con mysql: error receiving ReportState, e: -5 l: 0!
Dec 18 17:09:43 pn2 kernel: d-con mysql: asender terminated
Dec 18 17:09:43 pn2 kernel: d-con mysql: Terminating asender thread
Dec 18 17:09:43 pn2 kernel: d-con mysql: Connection closed
Dec 18 17:09:43 pn2 kernel: d-con mysql: conn( Disconnecting -> StandAlone )
Dec 18 17:09:43 pn2 kernel: d-con mysql: receiver terminated
Dec 18 17:09:43 pn2 kernel: d-con mysql: Terminating receiver thread
解决方法
备节点
将PN2备节点强制将为secondary角色:
/etc/init.d/heartbeat stop
drbdadm secondary mysql
drbdadm connect --discard-my-data mysql
节点状态
cat /proc/drbd
version: 8.4.0 (api:1/proto:86-100)
GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by root@pn2, 2013-12-06 15:08:20
 0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r-----
    ns:0 nr:0 dw:740 dr:4405 al:10 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:316
主节点
PN1主节点重新链接:
drbdadm connect mysql
cat /proc/drbd
version: 8.4.0 (api:1/proto:86-100)
GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by root@pn1, 2013-12-06 14:48:27
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:20576 nr:0 dw:500 dr:24973 al:12 bm:12 lo:2 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
验证
PN1主节点
drbdadm verify mysql
PN1主节点日志
tail /var/log/messages
Dec 18 18:05:38 pn1 kernel: block drbd0: conn( Connected -> VerifyS )
Dec 18 18:05:38 pn1 kernel: block drbd0: Starting Online Verify from sector 0
Dec 18 18:10:00 pn1 kernel: block drbd0: Online verify  done (total 262 sec; paused 0 sec; 40020 K/sec)
Dec 18 18:10:00 pn1 kernel: block drbd0: conn( VerifyS -> Connected )
Dec 18 18:10:00 pn1 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies
Dec 18 18:10:00 pn1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
PN2备节点日志
tail /var/log/messages
Dec 18 18:05:37 pn2 kernel: block drbd0: conn( Connected -> VerifyT )
Dec 18 18:05:37 pn2 kernel: block drbd0: Online Verify start sector: 0
Dec 18 18:10:00 pn2 kernel: block drbd0: Online verify  done (total 262 sec; paused 0 sec; 40020 K/sec)
Dec 18 18:10:00 pn2 kernel: block drbd0: conn( VerifyT -> Connected )
Dec 18 18:10:00 pn2 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies
Dec 18 18:10:00 pn2 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
小结
- drbd/mysqld服务关闭,不会触发VIP切换;
- heartbeat不能接收到对方的心跳包,并且认为对方节点死亡,才能切换;
- heartbeat释放资源流程 ,停止- mysqld服务,- umount设备,- drbd进入- secondary角色,停止- vip;
- 关闭 heartbeat服务,节点自动释放资源,可以用于日常维护。