troubleshooting remote node crash reboot when stop Clusterware

这是来到新公司支持的第二个case，第一个case忘记记录了，不能偷懒，以后还是要记录案例。

本案例来自一个12.1的rac环境，故障现象为当关闭某节点的gi时，远端节点的os会reboot。

故障时间线：

dateline：
2024-03-24 13:20:54-13:26:17 关闭1节点GI
2024-03-24 13:26:12-13:29:58 2节点os重启
2024-03-24 13:38:22-13:39:23 关闭2节点GI
2024-03-24 13:36:17-13:43:49 1节点os重启

dateline：

2024-03-24 13:20:54-13:26:17 关闭1节点GI

2024-03-24 13:26:12-13:29:58 2节点os重启

2024-03-24 13:38:22-13:39:23 关闭2节点GI

2024-03-24 13:36:17-13:43:49 1节点os重启

查看集群alert和cssd日志并无有价值的信息，还好crash时kdump产生了vmcore文件

vmcore文件分析：

crash> bt
PID: 117709  TASK: ffff9b1e51db1070  CPU: 42  COMMAND: "oks_rbld"
#0 [ffff9b500177f5b0] machine_kexec at ffffffff9b265754
#1 [ffff9b500177f610] __crash_kexec at ffffffff9b3209a2
#2 [ffff9b500177f6e0] crash_kexec at ffffffff9b320a90
#3 [ffff9b500177f6f8] oops_end at ffffffff9b983778
#4 [ffff9b500177f720] no_context at ffffffff9b274ad4
#5 [ffff9b500177f770] __bad_area_nosemaphore at ffffffff9b274da2
#6 [ffff9b500177f7c0] bad_area_nosemaphore at ffffffff9b274ec4
#7 [ffff9b500177f7d0] __do_page_fault at ffffffff9b986730
#8 [ffff9b500177f840] do_page_fault at ffffffff9b986955
#9 [ffff9b500177f870] page_fault at ffffffff9b982768
[exception RIP: asmStrategyB+23]
RIP: ffffffffc0e2324d  RSP: ffff9b500177f928  RFLAGS: 00010206
RAX: 00000000000038cc  RBX: 0000000000000000  RCX: 0000000000000000
RDX: ffff9b2434b02800  RSI: ffff9b2a6d3b3880  RDI: 0000000000000000
RBP: ffff9b500177f940   R8: 000000000001f160   R9: ffff9b2434b02800
R10: ffff9b1e7fc07500  R11: 0000000000000001  R12: ffff9b2a6d3b3880
R13: ffff9b2434b02800  R14: ffff9b2a6d3b3880  R15: ffff9b4e93218000
ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff9b500177f948] asmIoctl_int at ffffffffc0e2348c [oracleadvm]
#11 [ffff9b500177f9d8] asmIoctl at ffffffffc0e235fc [oracleadvm]
#12 [ffff9b500177f9e8] blkdev_ioctl at ffffffff9b5612fa
#13 [ffff9b500177fa48] ioctl_by_bdev at ffffffff9b4887b3
#14 [ffff9b500177fa68] LinuxRWDisk at ffffffffc13863e4 [oracleacfs]
#15 [ffff9b500177fb78] OfsRecoveryGetLocalDirBlock at ffffffffc13575f3 [oracleacfs]
#16 [ffff9b500177fbb8] OfsGetRecoveryLockId at ffffffffc135a22c [oracleacfs]
#17 [ffff9b500177fc18] OfsCompleteRecoverySetup at ffffffffc135a795 [oracleacfs]
#18 [ffff9b500177fcb0] kcss_rbld_do_step at ffffffffc0d60aef [oracleoks]
#19 [ffff9b500177fd08] odlm_process_request at ffffffffc0d508c6 [oracleoks]
#20 [ffff9b500177fd60] odlm_comm_bcast at ffffffffc0d5654a [oracleoks]
#21 [ffff9b500177fde8] kcss_rbld_th at ffffffffc0d5fa00 [oracleoks]
#22 [ffff9b500177fe98] KsKthreadRun at ffffffffc0d2b16c [oracleoks]
#23 [ffff9b500177fec8] kthread at ffffffff9b2c50d1
crash> dis -l asmStrategyB+23 20
0xffffffffc0e2324d <asmStrategyB+23>:   mov    0x20(%rdi),%rdi
0xffffffffc0e23251 <asmStrategyB+27>:   callq  0xffffffffc0e1f7cd <bio_data_dir>
0xffffffffc0e23256 <asmStrategyB+32>:   cmp    $0x1,%eax
0xffffffffc0e23259 <asmStrategyB+35>:   jbe    0xffffffffc0e23278 <asmStrategyB+66>
0xffffffffc0e2325b <asmStrategyB+37>:   mov    $0x86a,%ecx
0xffffffffc0e23260 <asmStrategyB+42>:   mov    $0xffffffffc0e4804b,%rdx
0xffffffffc0e23267 <asmStrategyB+49>:   mov    $0x1,%esi
0xffffffffc0e2326c <asmStrategyB+54>:   mov    $0xffffffffc0e48ea4,%rdi
0xffffffffc0e23273 <asmStrategyB+61>:   callq  0xffffffffc0d2de90 <KsDoAssertion>
0xffffffffc0e23278 <asmStrategyB+66>:   test   %r13,%r13
0xffffffffc0e2327b <asmStrategyB+69>:   je     0xffffffffc0e23292 <asmStrategyB+92>
0xffffffffc0e2327d <asmStrategyB+71>:   mov    $0x1,%eax
0xffffffffc0e23282 <asmStrategyB+76>:   lock xadd %eax,0x2d166(%rip)        # 0xffffffffc0e503f0
0xffffffffc0e2328a <asmStrategyB+84>:   mov    %r13,%rdx
0xffffffffc0e2328d <asmStrategyB+87>:   jmpq   0xffffffffc0e23318 <asmStrategyB+226>
0xffffffffc0e23292 <asmStrategyB+92>:   cmpq   $0x1,0x2d42e(%rip)        # 0xffffffffc0e506c8
0xffffffffc0e2329a <asmStrategyB+100>:  jne    0xffffffffc0e23309 <asmStrategyB+211>
0xffffffffc0e2329c <asmStrategyB+102>:  callq  0xffffffffc0e231f0 <asmStackChkCall>
0xffffffffc0e232a1 <asmStrategyB+107>:  mov    %eax,%eax
0xffffffffc0e232a3 <asmStrategyB+109>:  cmp    0x2d426(%rip),%rax        # 0xffffffffc0e506d0

crash> bt

PID: 117709 TASK: ffff9b1e51db1070 CPU: 42 COMMAND: "oks_rbld"

#0 [ffff9b500177f5b0] machine_kexec at ffffffff9b265754

#1 [ffff9b500177f610] __crash_kexec at ffffffff9b3209a2

#2 [ffff9b500177f6e0] crash_kexec at ffffffff9b320a90

#3 [ffff9b500177f6f8] oops_end at ffffffff9b983778

#4 [ffff9b500177f720] no_context at ffffffff9b274ad4

#5 [ffff9b500177f770] __bad_area_nosemaphore at ffffffff9b274da2

#6 [ffff9b500177f7c0] bad_area_nosemaphore at ffffffff9b274ec4

#7 [ffff9b500177f7d0] __do_page_fault at ffffffff9b986730

#8 [ffff9b500177f840] do_page_fault at ffffffff9b986955

#9 [ffff9b500177f870] page_fault at ffffffff9b982768

[exception RIP: asmStrategyB+23]

RIP: ffffffffc0e2324d RSP: ffff9b500177f928 RFLAGS: 00010206

RAX: 00000000000038cc RBX: 0000000000000000 RCX: 0000000000000000

RDX: ffff9b2434b02800 RSI: ffff9b2a6d3b3880 RDI: 0000000000000000

RBP: ffff9b500177f940 R8: 000000000001f160 R9: ffff9b2434b02800

R10: ffff9b1e7fc07500 R11: 0000000000000001 R12: ffff9b2a6d3b3880

R13: ffff9b2434b02800 R14: ffff9b2a6d3b3880 R15: ffff9b4e93218000

ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018

#10 [ffff9b500177f948] asmIoctl_int at ffffffffc0e2348c [oracleadvm]

#11 [ffff9b500177f9d8] asmIoctl at ffffffffc0e235fc [oracleadvm]

#12 [ffff9b500177f9e8] blkdev_ioctl at ffffffff9b5612fa

#13 [ffff9b500177fa48] ioctl_by_bdev at ffffffff9b4887b3

#14 [ffff9b500177fa68] LinuxRWDisk at ffffffffc13863e4 [oracleacfs]

#15 [ffff9b500177fb78] OfsRecoveryGetLocalDirBlock at ffffffffc13575f3 [oracleacfs]

#16 [ffff9b500177fbb8] OfsGetRecoveryLockId at ffffffffc135a22c [oracleacfs]

#17 [ffff9b500177fc18] OfsCompleteRecoverySetup at ffffffffc135a795 [oracleacfs]

#18 [ffff9b500177fcb0] kcss_rbld_do_step at ffffffffc0d60aef [oracleoks]

#19 [ffff9b500177fd08] odlm_process_request at ffffffffc0d508c6 [oracleoks]

#20 [ffff9b500177fd60] odlm_comm_bcast at ffffffffc0d5654a [oracleoks]

#21 [ffff9b500177fde8] kcss_rbld_th at ffffffffc0d5fa00 [oracleoks]

#22 [ffff9b500177fe98] KsKthreadRun at ffffffffc0d2b16c [oracleoks]

#23 [ffff9b500177fec8] kthread at ffffffff9b2c50d1

crash> dis -l asmStrategyB+23 20

0xffffffffc0e2324d <asmStrategyB+23>: mov 0x20(%rdi),%rdi

0xffffffffc0e23251 <asmStrategyB+27>: callq 0xffffffffc0e1f7cd <bio_data_dir>

0xffffffffc0e23256 <asmStrategyB+32>: cmp $0x1,%eax

0xffffffffc0e23259 <asmStrategyB+35>: jbe 0xffffffffc0e23278 <asmStrategyB+66>

0xffffffffc0e2325b <asmStrategyB+37>: mov $0x86a,%ecx

0xffffffffc0e23260 <asmStrategyB+42>: mov $0xffffffffc0e4804b,%rdx

0xffffffffc0e23267 <asmStrategyB+49>: mov $0x1,%esi

0xffffffffc0e2326c <asmStrategyB+54>: mov $0xffffffffc0e48ea4,%rdi

0xffffffffc0e23273 <asmStrategyB+61>: callq 0xffffffffc0d2de90 <KsDoAssertion>

0xffffffffc0e23278 <asmStrategyB+66>: test %r13,%r13

0xffffffffc0e2327b <asmStrategyB+69>: je 0xffffffffc0e23292 <asmStrategyB+92>

0xffffffffc0e2327d <asmStrategyB+71>: mov $0x1,%eax

0xffffffffc0e23282 <asmStrategyB+76>: lock xadd %eax,0x2d166(%rip) # 0xffffffffc0e503f0

0xffffffffc0e2328a <asmStrategyB+84>: mov %r13,%rdx

0xffffffffc0e2328d <asmStrategyB+87>: jmpq 0xffffffffc0e23318 <asmStrategyB+226>

0xffffffffc0e23292 <asmStrategyB+92>: cmpq $0x1,0x2d42e(%rip) # 0xffffffffc0e506c8

0xffffffffc0e2329a <asmStrategyB+100>: jne 0xffffffffc0e23309 <asmStrategyB+211>

0xffffffffc0e2329c <asmStrategyB+102>: callq 0xffffffffc0e231f0 <asmStackChkCall>

0xffffffffc0e232a1 <asmStrategyB+107>: mov %eax,%eax

0xffffffffc0e232a3 <asmStrategyB+109>: cmp 0x2d426(%rip),%rax # 0xffffffffc0e506d0

oks_rbld模块异常触发kernel panic导致了os reboot，通过oks_rbld和reboot关键字搜索mos可以看到几篇文档匹配

ALERT: Kernel Panic While Applying JULY 2019 DB Bundle Patch or PSU in 12.1.0.2 Cluster Environment with ACFS (Doc ID 2573961.1)
While patching July 2019 DB PSU/BP on 12.1.0.2 cluster on remote nodes, patched node encounters kernel panic if ACFS driver is loaded in the system. This happens when Clusterware on the remote node is stopped.

ALERT: Kernel Panic While Applying JULY 2019 DB Bundle Patch or PSU in 12.1.0.2 Cluster Environment with ACFS (Doc ID 2573961.1)

While patching July 2019 DB PSU/BP on 12.1.0.2 cluster on remote nodes, patched node encounters kernel panic if ACFS driver is loaded in the system. This happens when Clusterware on the remote node is stopped.

The node with the July 2019 Grid Infrastructure PSU will panic and reboot if the ACFS drivers are loaded on the system.
This will happen when Clusterware is starting back up during postpatch while applying the July 2019 GI PSU to another node.
All nodes with the July 2019 GI PSU will panic.

The node with the July 2019 Grid Infrastructure PSU will panic and reboot if the ACFS drivers are loaded on the system.

This will happen when Clusterware is starting back up during postpatch while applying the July 2019 GI PSU to another node.

All nodes with the July 2019 GI PSU will panic.

匹配的bug：

Bug 30139389 – ACFS Produces Kernel Panic After Installing 12.1 July 2019 GIPSU (Doc ID 30139389.8)

适用范围：

Affects:

Product (Component)

Range of versions believed to be affected Versions BELOW 12.2

Versions confirmed as being affected

12.1.0.2.190716 (Jul 2019) Grid Infrastructure Patch Set Update (GI PSU)

Platforms affected Generic (all / most platforms affected)

Fixed:

The fix for 30139389 is first included in

12.2.0.1 (Base Release)

12.1.0.2.191015 (Oct 2019) Grid Infrastructure Patch Set Update (GI PSU)

its recommended to apply merge patch of 30293309 and 30139389

bug描述在12.1版本使用了acfs，并且应用了19年7月的psu则会触发该bug。

message日志也有匹配的输出：

Mar 24 13:21:04 lm0ora12 multipathd: asm!vol_acfs-225: remove path (uevent)
Mar 24 13:21:04 lm0ora12 multipathd: asm!vol_acfs-225: path already removed
Mar 24 13:21:04 lm0ora12 kernel: ADVMK-0006: Volume vol_acfs-225 in diskgroup BACKUP disabled.
lm0ora12 kernel: OKSK-00004: Module load succeeded. Build information:   (LOW DEBUG) USM_12.1.0.2.0ACFSPSU_LINUX.X64_190531 2019/05/31 04:38:35
lm0ora12 kernel: ADVMK-0001: Module load succeeded. Build information:  (LOW DEBUG) - USM_12.1.0.2.0ACFSPSU_LINUX.X64_190531 built on 2019/05/31 05:23:44.
lm0ora12 kernel: ACFSK-0037: Module load succeeded. Build information:   (LOW DEBUG) USM_12.1.0.2.0ACFSPSU_LINUX.X64_190531 2019/05/31 07:16:34

Mar 24 13:21:04 lm0ora12 multipathd: asm!vol_acfs-225: remove path (uevent)

Mar 24 13:21:04 lm0ora12 multipathd: asm!vol_acfs-225: path already removed

Mar 24 13:21:04 lm0ora12 kernel: ADVMK-0006: Volume vol_acfs-225 in diskgroup BACKUP disabled.

lm0ora12 kernel: OKSK-00004: Module load succeeded. Build information: (LOW DEBUG) USM_12.1.0.2.0ACFSPSU_LINUX.X64_190531 2019/05/31 04:38:35

lm0ora12 kernel: ADVMK-0001: Module load succeeded. Build information: (LOW DEBUG) - USM_12.1.0.2.0ACFSPSU_LINUX.X64_190531 built on 2019/05/31 05:23:44.

lm0ora12 kernel: ACFSK-0037: Module load succeeded. Build information: (LOW DEBUG) USM_12.1.0.2.0ACFSPSU_LINUX.X64_190531 2019/05/31 07:16:34

当前GI的psu信息：

Patch  29494060     : applied on Fri Dec 04 17:05:48 CST 2020
Unique Patch ID:  22993235
Patch description:  "Database Patch Set Update : 12.1.0.2.190716 (29494060)"

Patch 29494060 : applied on Fri Dec 04 17:05:48 CST 2020

Unique Patch ID: 22993235

Patch description: "Database Patch Set Update : 12.1.0.2.190716 (29494060)"

也非常匹配。

workaroud提供了四种方法规避此bug：

停止gi之前，umount所有的acfs
回退19年7月的psu
打19年10月的psu
打one-off patch

简单记录一下问题。over

troubleshooting remote node crash reboot when stop Clusterware

Affects:

Fixed:

发表回复取消回复

近期文章

近期评论

*Product (Component)*
*Range of versions believed* to be affected**	Versions BELOW 12.2
*Versions confirmed* as being affected**	12.1.0.2.190716 (Jul 2019) Grid Infrastructure Patch Set Update (GI PSU)
Platforms affected	Generic (all / most platforms affected)

troubleshooting remote node crash reboot when stop Clusterware

Affects:

Fixed:

发表回复 取消回复

近期文章

近期评论

发表回复取消回复