一、背景
备库为 3 节点 的11.2.0.4的RAC 。本次操作在 12 节点取消 MRP,并执行 finish force ,activate触发 failover。
Activate会触发备库所有 thread 均需关闭。
二、各节点 alert log
11 节点:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
Sat Jun 13 20:43:17 2026 Completed: alter database recover managed standby database finish force Sat Jun 13 20:43:17 2026 alter database activate physical standby database ALTER DATABASE ACTIVATE [PHYSICAL] STANDBY DATABASE (prduf12) CLOSE: killing server sessions. Sat Jun 13 20:43:33 2026 <-- SMON: disabling cache recovery 这段时间经过了5分钟 Sat Jun 13 20:48:15 2026 <-- RFS[25]: Assigned to RFS process 30068 RFS[25]: No connections allowed during/after terminal recovery. Sat Jun 13 20:48:34 2026 ARC5: Archiving not possible: error count exceeded Sat Jun 13 20:48:56 2026 RFS[26]: Assigned to RFS process 36233 RFS[26]: No connections allowed during/after terminal recovery. Sat Jun 13 20:49:56 2026 Begin: Standby Redo Logfile archival End: Standby Redo Logfile archival Sat Jun 13 20:49:56 2026 Archiver process freed from errors. No longer stopped Standby terminal recovery start SCN: 2687187656678 RESETLOGS after complete recovery through change 2687187657129 |
12 节点:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Sat Jun 13 20:43:59 2026 SMON: disabling cache recovery Sat Jun 13 20:48:14 2026 ARC6: Archiving not possible: error count exceeded Sat Jun 13 20:48:59 2026 RFS[17]: Assigned to RFS process 5492 RFS[17]: No connections allowed during/after terminal recovery. Sat Jun 13 20:49:15 2026 RFS[18]: Assigned to RFS process 22774 RFS[18]: No connections allowed during/after terminal recovery. Sat Jun 13 20:50:00 2026 Setting recovery target incarnation to 38 ACTIVATE STANDBY: Complete - Database mounted as primary |
13 节点:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
Sat Jun 13 20:43:46 2026 SMON: disabling cache recovery Sat Jun 13 20:43:47 2026 ARC2: Waiting for instance close to complete Sat Jun 13 20:44:08 2026 ARC1: Waiting for instance close to complete Sat Jun 13 20:44:31 2026 ARC4: Waiting for instance close to complete Sat Jun 13 20:44:35 2026 ARC5: Waiting for instance close to complete Sat Jun 13 20:45:13 2026 ARC3: Waiting for instance close to complete Sat Jun 13 20:45:45 2026 ARC6: Waiting for instance close to complete Sat Jun 13 20:45:47 2026 ARC2: Wait for instance close timed out ARC2: Archiving not possible: error count exceeded ARC2: Waiting for instance close to complete Sat Jun 13 20:46:00 2026 ARC0: Waiting for instance close to complete Sat Jun 13 20:46:08 2026 ARC1: Wait for instance close timed out ARC1: Archiving not possible: error count exceeded ARC1: Waiting for instance close to complete Sat Jun 13 20:46:31 2026 ARC4: Wait for instance close timed out ARC4: Archiving not possible: error count exceeded ARC4: Waiting for instance close to complete Sat Jun 13 20:46:35 2026 ARC5: Wait for instance close timed out Sat Jun 13 20:47:13 2026 ARC3: Wait for instance close timed out Sat Jun 13 20:47:45 2026 ARC6: Wait for instance close timed out Sat Jun 13 20:47:47 2026 ARC2: Wait for instance close timed out ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance prduf13 - Archival Error ORA-16014: log 51 sequence# 86001 not archived, no available destinations ORA-00312: online log 51 thread 3: '+ASMDATA/prduf1/onlinelog/group_51.750.1107126615' ORA-00312: online log 51 thread 3: '+ASMDATA/prduf1/onlinelog/group_51.772.1107126617' Sat Jun 13 20:48:00 2026 ARC0: Wait for instance close timed out Archiver process freed from errors. No longer stopped Sat Jun 13 20:48:08 2026 ARC1: Wait for instance close timed out ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance prduf13 - Archival Error ORA-16014: log 31 sequence# 107710 not archived, no available destinations ORA-00312: online log 31 thread 1: '+ASMDATA/prduf1/onlinelog/group_31.624.1107126565' ORA-00312: online log 31 thread 1: '+ASMDATA/prduf1/onlinelog/group_31.836.1107126567' Sat Jun 13 20:48:31 2026 ARC4: Wait for instance close timed out Sat Jun 13 20:50:00 2026 Setting recovery target incarnation to 38 ACTIVATE STANDBY: Complete - Database mounted as primary |
三、分析
切换期间,11、12 节点均在等待 13 节点的 thread 关闭,而 13 节点的 thread 关闭过程耗时接近 5 分钟;13 节点 arch 进程持续出现的等待事件 Wait on stby instance close 也印证了这一点。
ASH 分析显示,在 20:43—20:50 期间,活跃会话基本仅为 13 节点的 CKPT,整体等待链大致如下:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
[13:7753] CKPT ON CPU (等待 latch: gc element wait_seq#6186) 374s 20:43:00~20:49:55 BLK=ROOT(因为wait_time>0并且wait_seq#持续没有变化, 所以其实CKPT在ON CPU上非等待) ├─ [12:9468] sqlplus@prduf12 WAIT DFS lock handle 400s 20:43:14~20:49:55 BLK=[inst=13, sid=7753] └─ [12:5190] sqlplus@prduf12 WAIT DFS lock handle 340s 20:44:14~20:49:55 BLK=[inst=13, sid=7753] ← activate会话 └─ [11:2480] sqlplus@prduf11 WAIT DFS lock handle 318s 20:44:35~20:49:55 BLK=[inst=12, sid=5190] ├─ [11:11418] sqlplus@prduf11 WAIT DFS lock handle 261s 20:44:30~20:49:55 BLK=[inst=11, sid=2480] │ └─ [11:11273] sqlplus@prduf11 DFS lock handle 203s 20:45:27~20:49:55 BLK=[inst=11, sid=11418] │ └─ [11:12228] sqlplus@prduf11 DFS lock handle 140s 20:47:35~20:49:55 BLK=[inst=11, sid=11273] │ └─ [11:11906] sqlplus@prduf11 DFS lock handle 95s 20:45:43~20:49:55 BLK=[inst=11, sid=12228] │ └─ [11:11097] sqlplus@prduf11 DFS lock handle 21s 20:49:35~20:49:55 BLK=[inst=11, sid=11906] └─ [12:10821] sqlplus@prduf12 WAIT DFS lock handle 280s 20:45:15~20:49:55 BLK=[inst=11, sid=2480] └─ [12:11263] sqlplus@prduf12 DFS lock handle 221s 20:46:14~20:49:55 BLK=[inst=12, sid=10821] └─ [12:11458] sqlplus@prduf12 DFS lock handle 161s 20:47:14~20:49:55 BLK=[inst=12, sid=11263] └─ [12:11568] sqlplus@prduf12 DFS lock handle 101s 20:48:14~20:49:55 BLK=[inst=12, sid=11458] └─ [12:9577] sqlplus@prduf12 DFS lock handle 42s 20:49:14~20:49:55 BLK=[inst=12, sid=11568] |
四、CKPT 的 TOP 信息
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
zzz ***Sat Jun 13 20:43:51 CST 2026 top - 20:43:52 up 231 days, 5:44, 2 users, load average: 0.91, 0.53, 0.37 Tasks: 2110 total, 2 running, 2107 sleeping, 1 stopped, 0 zombie %Cpu(s): 0.6 us, 0.2 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 10562811+total, 29444150+free, 54196697+used, 21987268+buff/cache KiB Swap: 10485760+total, 10485760+free, 0 used. 48608428+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 87295 oracle 20 0 500.2g 23996 16768 R 100.0 0.0 619:08.66 oracle 36429 root 20 0 7488 3836 680 S 8.7 0.0 0:00.18 pidstat 36430 root 20 0 7488 3844 680 S 8.7 0.0 0:00.17 pidstat 87036 oracle 20 0 500.3g 117832 14688 S 6.7 0.0 1315:58 oracle 63969 root 20 0 961752 359444 17520 T 4.8 0.0 17168:19 titanagent |
只有一个进程在满负荷运作,由 ps 输出可见,该进程正是 CKPT。问题时间段一直持续着CPU 100%
|
1 |
oracle 87295 1 19 1.5 0.0 524545068 22340 SYSC_semtimedop S May 16 10:18:44 ora_ckpt_prduf13 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
20:43:41 87295 oracle 20 0 500.2g 22340 16380 S 1.0 0.0 619:03.32 oracle 20:43:51 87295 oracle 20 0 500.2g 23996 16768 R 100.0 0.0 619:08.66 oracle 20:44:01 87295 oracle 20 0 500.2g 23996 16768 R 100.0 0.0 619:18.65 oracle 20:44:11 87295 oracle 20 0 500.2g 23996 16768 R 100.0 0.0 619:28.64 oracle 20:44:21 87295 oracle 20 0 500.2g 23996 16768 R 100.0 0.0 619:38.64 oracle 20:44:31 87295 oracle 20 0 500.2g 23996 16768 R 100.0 0.0 619:48.63 oracle 20:44:41 87295 oracle 20 0 500.2g 23996 16768 R 100.0 0.0 619:58.62 oracle 20:44:51 87295 oracle 20 0 500.2g 23996 16768 R 100.0 0.0 620:08.62 oracle 20:45:01 87295 oracle 20 0 500.2g 23996 16768 R 99.0 0.0 620:18.62 oracle 20:45:11 87295 oracle 20 0 500.2g 23996 16768 R 99.0 0.0 620:28.60 oracle 20:45:21 87295 oracle 20 0 500.2g 23996 16768 R 100.0 0.0 620:38.61 oracle 20:45:31 87295 oracle 20 0 500.2g 23996 16768 R 100.0 0.0 620:48.60 oracle 20:45:41 87295 oracle 20 0 500.2g 23996 16768 R 100.0 0.0 620:58.59 oracle 20:45:51 87295 oracle 20 0 500.2g 23996 16768 R 100.0 0.0 621:08.59 oracle 20:46:01 87295 oracle 20 0 500.3g 34492 27248 R 99.0 0.0 621:18.58 oracle 20:46:11 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 621:28.57 oracle 20:46:21 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 621:38.57 oracle 20:46:31 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 621:48.56 oracle 20:46:41 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 621:58.56 oracle 20:46:51 87295 oracle 20 0 500.3g 34492 27248 R 99.0 0.0 622:08.55 oracle 20:47:01 87295 oracle 20 0 500.3g 34492 27248 R 99.0 0.0 622:18.55 oracle 20:47:11 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 622:28.55 oracle 20:47:21 87295 oracle 20 0 500.3g 34492 27248 R 99.0 0.0 622:38.54 oracle 20:47:31 87295 oracle 20 0 500.3g 34492 27248 R 99.0 0.0 622:48.53 oracle 20:47:41 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 622:58.54 oracle 20:47:51 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 623:08.53 oracle 20:48:01 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 623:18.53 oracle 20:48:11 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 623:28.52 oracle 20:48:21 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 623:38.52 oracle 20:48:31 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 623:48.51 oracle 20:48:41 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 623:58.51 oracle 20:48:51 87295 oracle 20 0 500.3g 34492 27248 R 99.0 0.0 624:08.50 oracle 20:49:01 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 624:18.50 oracle 20:49:11 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 624:28.49 oracle 20:49:21 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 624:38.49 oracle 20:49:31 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 624:48.49 oracle 20:49:41 87295 oracle 20 0 500.3g 34492 27248 R 100.0 0.0 624:58.49 oracle 20:49:51 87295 oracle 20 0 500.3g 34540 27296 R 100.0 0.0 625:08.27 oracle |
五、结论
初步判断 CKPT 很可能处于 CPU spin 状态(大概率是 bug),但由于所有 trace 中均未抓取到 CKPT 的 call stack,无法确认其在此期间究竟在做什么,也无法通过堆栈去搜查具体问题。
六、建议下次复现问题时
|
1 |
pstack XXX(CKPT的PID),每隔 3s 执行一次,连续执行 5~6 次并记录输出 |
或者
|
1 2 |
oradebug setospid XXX(CKPT的PID) oradebug short_stack,每隔 3s 执行一次,连续执行 5~6 次并记录输出 |
七、规避方案(如果不需要分析 bug)
建议在 cancel MRP 之前,先将非 MRP 所在实例关闭(保留 MRP 所在实例继续 apply,其余两个实例以 shutdown abort 方式关闭),而非仅依赖 thread close;待 failover 完成后再重新启动这两个实例。