本案例来自西区某客户,数据库版本为rac 19.16,2节点的db实例被驱逐,报错ORA-29740
由于2节点被驱逐,所以首先看看2节点alert:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
2023-11-06T22:24:38.077225+08:00 Detected an inconsistent instance membership by instance 1 Errors in file /oracle/app/oracle/diag/rdbms/racdb/racdb2/trace/racdb2_lmon_363356.trc (incident=4320189): ORA-29740: evicted by instance number 1, group incarnation 13 Incident details in: /oracle/app/oracle/diag/rdbms/racdb/racdb2/incident/incdir_4320189/racdb2_lmon_363356_i4320189.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. 2023-11-06T22:24:38.976251+08:00 Errors in file /oracle/app/oracle/diag/rdbms/racdb/racdb2/trace/racdb2_lmon_363356.trc: ORA-29740: evicted by instance number 1, group incarnation 13 Errors in file /oracle/app/oracle/diag/rdbms/racdb/racdb2/trace/racdb2_lmon_363356.trc (incident=4320190): ORA-29740 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /oracle/app/oracle/diag/rdbms/racdb/racdb2/incident/incdir_4320190/racdb2_lmon_363356_i4320190.trc 2023-11-06T22:24:39.693332+08:00 LMON (ospid: 363356): terminating the instance due to ORA error 29740 |
22:24:38可以看到2节点被1节点驱逐,lmon产生了trace,我没去看,因为CGS层面的db实例的驱逐本身就是lmon进程去实现的。因为是1节点驱逐的2节点,那么1节点的alert日志是不能漏掉的。
1节点alert:
1 2 3 4 5 6 7 8 9 10 11 |
2023-11-06T22:23:49.624237+08:00 IPC Send timeout to 2.2 inc 11 for msg type 36 from opid 61 2023-11-06T22:23:49.626416+08:00 Communications reconfiguration: instance_number 2 from SMON (ospid: 390100) 2023-11-06T22:24:38.081438+08:00 Detected an inconsistent instance membership by instance 1 Evicting instance 2 from cluster Waiting for instances to leave: 2 2023-11-06T22:24:46.157136+08:00 Increasing priority of 5 RS Reconfiguration started (old inc 11, new inc 15) |
可以看到22:23:49就出现了ipc send timeout,并且发现触发reconfig的原因与SMON有关系,所以查看SMON trace可以发现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
*** 2023-11-06T22:23:49.624060+08:00 IPCLW:[0.1]{-}[WAIT]:UTIL: [1699280629623930] ACNH 0x7fb1108ee260 State: 3 SMSN: 1043354486 PKT(1043402479.1188389702) # Pending: 1 IPCLW:[0.2]{-}[WAIT]:UTIL: [1699280629623930] Peer: LMS1.KSMSQ_dlm.363374 AckSeq: 1188389701 IPCLW:[0.3]{-}[WAIT]:UTIL: [1699280629623930] Flags: 0x40000000 IHint: 0x501e46a00000019 THint: 0x5863c1f300000010 IPCLW:[0.4]{-}[WAIT]:UTIL: [1699280629623930] Local Address: 169.254.10.192:33365 Remote Address: 169.254.5.139:18617 IPCLW:[0.5]{-}[WAIT]:UTIL: [1699280629623930] Remote PID: ver 0 flags 1 trans 2 tos 0 opts 0 xdata3 41 xdata2 c135a0d7 IPCLW:[0.6]{-}[WAIT]:UTIL: [1699280629623930] : mmsz 8288 mmr 9200 mms 2 xdata 3a937d6 IPCLW:[0.7]{-}[WAIT]:UTIL: [1699280629623930] IVPort: 40790 TVPort: 14294 IMPT: 21890 RMPT: 65 Pending Sends: No Unacked Sends: Yes IPCLW:[0.8]{-}[WAIT]:UTIL: [1699280629623930] Send Engine Queued: No sshdl -1 ssts 0 rtts 0 snderrchk 0 creqcnt 1 credits 7/8 IPCLW:[0.9]{-}[WAIT]:UTIL: [1699280629623930] Unackd Messages 1043402478 -> 1043402478. SSEQ 1188389701 Send Time: 0:5:0.653.653784 SMSN # Xmits: 1 EMSN 0:5:0.653.653784 IPCLW:[0.10]{-}[WAIT]:UTIL: [1699280629623930] [0] mbuf 0x7fb11073eaf0 MSN 1043402478 Seq 1188389701 -> 1188389702 Send Time: 0:5:0.653.653784 # XMits: 1 2023-11-06 22:23:49.624 : GSIPC:MSGQCB: msg 0x342388d50 status 1.20, type 36, dest 2.2 seq 0.33419001 2023-11-06 22:23:49.624 : GSIPC:MSGQCB: msg 0x342388d50 send failed inc 11 waited 300653848 usec 2023-11-06 22:23:49.624 : GSIPC:MSGQCB: dest_inc 11 sys_inc 11 |
IPCLW层面出现了超时,报错的时间与IPC timeout时间一致,GSIPC:MSGQCB: msg 0x342388d50 send failed inc 11 waited 300653848 usec,说明已经超时了300秒。
看看SMON的ASH:
1 2 3 4 5 6 |
select program,sql_exec_start,event,p1,p2,count(*) from dba_hist_active_sess_history where sample_time between to_date('2023-11-06 21:30:00','yyyy-mm-dd hh24:mi:ss') and to_date ('2023-11-06 23:30:00','yyyy-mm-dd hh24:mi:ss') and sql_id='679x4qggryd2v' group by program,sql_exec_start,event,p1,p2 PROGRAM SQL_EXEC_START EVENT P1 P2 COUNT(*) oracle@racdb1 (SMON) 2023/11/6 22:18:48 gc cr failure 3 121167 255 oracle@racdb1 (SMON) 2023/11/6 22:18:48 gc cr request 3 121167 68 |
SMON从22:18:48一直执行sql 679x4qggryd2v,一直在请求同一个远程cr块,并且有大量的gc cr failure 。从2023/11/6 22:18:48开始到22:23:49正好也是trace中报出的超时300秒。
sql文本为:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
SELECT smontabv.cnt ,smontabv.time_mp ,smontab.scn ,smontab.num_mappings ,smontab.tim_scn_map FROM smon_scn_time smontab ,( SELECT max(scn) scnmax ,count(*) + sum(NVL2(TIM_SCN_MAP, NUM_MAPPINGS, 0)) cnt ,max(time_mp) time_mp FROM smon_scn_time ) smontabv WHERE smontab.scn = smontabv.scnmax |
与smon_scn_time的功能有关系,但这应该不是主要原因,主要原因判断还是与私网有关系。
从osw的traceroute发现,对远程节点的traceroute出现了大量的超时。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
zzz ***Mon Nov 6 22:18:24 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.024 ms 0.007 ms 0.006 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.048 ms 0.037 ms 0.060 ms zzz ***Mon Nov 6 22:18:54 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.035 ms 0.006 ms 0.005 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.062 ms 0.094 ms 0.088 ms zzz ***Mon Nov 6 22:19:24 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.030 ms 0.006 ms 0.005 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.063 ms 0.056 ms * zzz ***Mon Nov 6 22:19:54 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.031 ms 0.007 ms 0.006 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.059 ms * * zzz ***Mon Nov 6 22:20:24 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.037 ms 0.011 ms 0.006 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 * racdb2-priv (10.10.15.12) 0.074 ms 0.049 ms zzz ***Mon Nov 6 22:20:54 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.030 ms 0.005 ms 0.005 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.063 ms * * zzz ***Mon Nov 6 22:21:25 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.029 ms 0.006 ms 0.006 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.049 ms 0.062 ms 0.089 ms zzz ***Mon Nov 6 22:21:55 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.026 ms 0.006 ms 0.006 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.113 ms * * zzz ***Mon Nov 6 22:22:25 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.030 ms 0.008 ms 0.006 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.062 ms 0.049 ms 0.045 ms zzz ***Mon Nov 6 22:22:55 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.031 ms 0.006 ms 0.005 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 * racdb2-priv (10.10.15.12) 0.044 ms 0.093 ms zzz ***Mon Nov 6 22:23:25 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.027 ms 0.006 ms 0.005 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.095 ms 0.078 ms 0.053 ms zzz ***Mon Nov 6 22:23:55 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.024 ms 0.007 ms 0.006 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 * racdb2-priv (10.10.15.12) 0.082 ms 0.084 ms zzz ***Mon Nov 6 22:24:25 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.029 ms 0.006 ms 0.006 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.057 ms 0.145 ms * zzz ***Mon Nov 6 22:24:55 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.032 ms 0.006 ms 0.005 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.098 ms 0.091 ms 0.085 ms zzz ***Mon Nov 6 22:25:25 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.032 ms 0.006 ms 0.006 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.058 ms 0.088 ms 0.041 ms zzz ***Mon Nov 6 22:25:55 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.033 ms 0.007 ms 0.006 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.056 ms 0.055 ms 0.049 ms zzz ***Mon Nov 6 22:26:25 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.028 ms 0.006 ms 0.005 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.066 ms 0.059 ms 0.066 ms zzz ***Mon Nov 6 22:26:55 CST 2023 traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets 1 racdb1-priv (10.10.15.11) 0.038 ms 0.006 ms 0.006 ms traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets 1 racdb2-priv (10.10.15.12) 0.144 ms * 0.107 ms |
message中并未发现异常。但是私网肯定还是有问题的,因为搜了一下alert各个时间点都存在ipc timeout。最终的建议是检查交换机、光纤线和网卡。