这是来到新公司支持的第二个case,第一个case忘记记录了,不能偷懒,以后还是要记录案例。
本案例来自一个12.1的rac环境,故障现象为当关闭某节点的gi时,远端节点的os会reboot。
故障时间线:
1 2 3 4 5 |
dateline: 2024-03-24 13:20:54-13:26:17 关闭1节点GI 2024-03-24 13:26:12-13:29:58 2节点os重启 2024-03-24 13:38:22-13:39:23 关闭2节点GI 2024-03-24 13:36:17-13:43:49 1节点os重启 |
查看集群alert和cssd日志并无有价值的信息,还好crash时kdump产生了vmcore文件
vmcore文件分析:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
crash> bt PID: 117709 TASK: ffff9b1e51db1070 CPU: 42 COMMAND: "oks_rbld" #0 [ffff9b500177f5b0] machine_kexec at ffffffff9b265754 #1 [ffff9b500177f610] __crash_kexec at ffffffff9b3209a2 #2 [ffff9b500177f6e0] crash_kexec at ffffffff9b320a90 #3 [ffff9b500177f6f8] oops_end at ffffffff9b983778 #4 [ffff9b500177f720] no_context at ffffffff9b274ad4 #5 [ffff9b500177f770] __bad_area_nosemaphore at ffffffff9b274da2 #6 [ffff9b500177f7c0] bad_area_nosemaphore at ffffffff9b274ec4 #7 [ffff9b500177f7d0] __do_page_fault at ffffffff9b986730 #8 [ffff9b500177f840] do_page_fault at ffffffff9b986955 #9 [ffff9b500177f870] page_fault at ffffffff9b982768 [exception RIP: asmStrategyB+23] RIP: ffffffffc0e2324d RSP: ffff9b500177f928 RFLAGS: 00010206 RAX: 00000000000038cc RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff9b2434b02800 RSI: ffff9b2a6d3b3880 RDI: 0000000000000000 RBP: ffff9b500177f940 R8: 000000000001f160 R9: ffff9b2434b02800 R10: ffff9b1e7fc07500 R11: 0000000000000001 R12: ffff9b2a6d3b3880 R13: ffff9b2434b02800 R14: ffff9b2a6d3b3880 R15: ffff9b4e93218000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff9b500177f948] asmIoctl_int at ffffffffc0e2348c [oracleadvm] #11 [ffff9b500177f9d8] asmIoctl at ffffffffc0e235fc [oracleadvm] #12 [ffff9b500177f9e8] blkdev_ioctl at ffffffff9b5612fa #13 [ffff9b500177fa48] ioctl_by_bdev at ffffffff9b4887b3 #14 [ffff9b500177fa68] LinuxRWDisk at ffffffffc13863e4 [oracleacfs] #15 [ffff9b500177fb78] OfsRecoveryGetLocalDirBlock at ffffffffc13575f3 [oracleacfs] #16 [ffff9b500177fbb8] OfsGetRecoveryLockId at ffffffffc135a22c [oracleacfs] #17 [ffff9b500177fc18] OfsCompleteRecoverySetup at ffffffffc135a795 [oracleacfs] #18 [ffff9b500177fcb0] kcss_rbld_do_step at ffffffffc0d60aef [oracleoks] #19 [ffff9b500177fd08] odlm_process_request at ffffffffc0d508c6 [oracleoks] #20 [ffff9b500177fd60] odlm_comm_bcast at ffffffffc0d5654a [oracleoks] #21 [ffff9b500177fde8] kcss_rbld_th at ffffffffc0d5fa00 [oracleoks] #22 [ffff9b500177fe98] KsKthreadRun at ffffffffc0d2b16c [oracleoks] #23 [ffff9b500177fec8] kthread at ffffffff9b2c50d1 crash> dis -l asmStrategyB+23 20 0xffffffffc0e2324d <asmStrategyB+23>: mov 0x20(%rdi),%rdi 0xffffffffc0e23251 <asmStrategyB+27>: callq 0xffffffffc0e1f7cd <bio_data_dir> 0xffffffffc0e23256 <asmStrategyB+32>: cmp $0x1,%eax 0xffffffffc0e23259 <asmStrategyB+35>: jbe 0xffffffffc0e23278 <asmStrategyB+66> 0xffffffffc0e2325b <asmStrategyB+37>: mov $0x86a,%ecx 0xffffffffc0e23260 <asmStrategyB+42>: mov $0xffffffffc0e4804b,%rdx 0xffffffffc0e23267 <asmStrategyB+49>: mov $0x1,%esi 0xffffffffc0e2326c <asmStrategyB+54>: mov $0xffffffffc0e48ea4,%rdi 0xffffffffc0e23273 <asmStrategyB+61>: callq 0xffffffffc0d2de90 <KsDoAssertion> 0xffffffffc0e23278 <asmStrategyB+66>: test %r13,%r13 0xffffffffc0e2327b <asmStrategyB+69>: je 0xffffffffc0e23292 <asmStrategyB+92> 0xffffffffc0e2327d <asmStrategyB+71>: mov $0x1,%eax 0xffffffffc0e23282 <asmStrategyB+76>: lock xadd %eax,0x2d166(%rip) # 0xffffffffc0e503f0 0xffffffffc0e2328a <asmStrategyB+84>: mov %r13,%rdx 0xffffffffc0e2328d <asmStrategyB+87>: jmpq 0xffffffffc0e23318 <asmStrategyB+226> 0xffffffffc0e23292 <asmStrategyB+92>: cmpq $0x1,0x2d42e(%rip) # 0xffffffffc0e506c8 0xffffffffc0e2329a <asmStrategyB+100>: jne 0xffffffffc0e23309 <asmStrategyB+211> 0xffffffffc0e2329c <asmStrategyB+102>: callq 0xffffffffc0e231f0 <asmStackChkCall> 0xffffffffc0e232a1 <asmStrategyB+107>: mov %eax,%eax 0xffffffffc0e232a3 <asmStrategyB+109>: cmp 0x2d426(%rip),%rax # 0xffffffffc0e506d0 |
oks_rbld模块异常触发kernel panic导致了os reboot,通过oks_rbld和reboot关键字搜索mos可以看到几篇文档匹配
1 2 |
ALERT: Kernel Panic While Applying JULY 2019 DB Bundle Patch or PSU in 12.1.0.2 Cluster Environment with ACFS (Doc ID 2573961.1) While patching July 2019 DB PSU/BP on 12.1.0.2 cluster on remote nodes, patched node encounters kernel panic if ACFS driver is loaded in the system. This happens when Clusterware on the remote node is stopped. |
1 2 3 |
The node with the July 2019 Grid Infrastructure PSU will panic and reboot if the ACFS drivers are loaded on the system. This will happen when Clusterware is starting back up during postpatch while applying the July 2019 GI PSU to another node. All nodes with the July 2019 GI PSU will panic. |
匹配的bug:
- Bug 30139389 – ACFS Produces Kernel Panic After Installing 12.1 July 2019 GIPSU (Doc ID 30139389.8)
适用范围:
Affects:
Product (Component) Range of versions believed to be affected Versions BELOW 12.2 Versions confirmed as being affected Platforms affected Generic (all / most platforms affected)
Fixed:
The fix for 30139389 is first included in
its recommended to apply merge patch of 30293309 and 30139389
bug描述在12.1版本使用了acfs,并且应用了19年7月的psu则会触发该bug。
message日志也有匹配的输出:
1 2 3 4 5 6 7 |
Mar 24 13:21:04 lm0ora12 multipathd: asm!vol_acfs-225: remove path (uevent) Mar 24 13:21:04 lm0ora12 multipathd: asm!vol_acfs-225: path already removed Mar 24 13:21:04 lm0ora12 kernel: ADVMK-0006: Volume vol_acfs-225 in diskgroup BACKUP disabled. lm0ora12 kernel: OKSK-00004: Module load succeeded. Build information: (LOW DEBUG) USM_12.1.0.2.0ACFSPSU_LINUX.X64_190531 2019/05/31 04:38:35 lm0ora12 kernel: ADVMK-0001: Module load succeeded. Build information: (LOW DEBUG) - USM_12.1.0.2.0ACFSPSU_LINUX.X64_190531 built on 2019/05/31 05:23:44. lm0ora12 kernel: ACFSK-0037: Module load succeeded. Build information: (LOW DEBUG) USM_12.1.0.2.0ACFSPSU_LINUX.X64_190531 2019/05/31 07:16:34 |
当前GI的psu信息:
1 2 3 |
Patch 29494060 : applied on Fri Dec 04 17:05:48 CST 2020 Unique Patch ID: 22993235 Patch description: "Database Patch Set Update : 12.1.0.2.190716 (29494060)" |
也非常匹配。
workaroud提供了四种方法规避此bug:
- 停止gi之前,umount所有的acfs
- 回退19年7月的psu
- 打19年10月的psu
- 打one-off patch
简单记录一下问题。over