troubleshooting instance evicted due to ORA error 29740

本案例来自西区某客户，数据库版本为rac 19.16，2节点的db实例被驱逐，报错ORA-29740

由于2节点被驱逐，所以首先看看2节点alert：

2023-11-06T22:24:38.077225+08:00
Detected an inconsistent instance membership by instance 1
Errors in file /oracle/app/oracle/diag/rdbms/racdb/racdb2/trace/racdb2_lmon_363356.trc  (incident=4320189):
ORA-29740: evicted by instance number 1, group incarnation 13
Incident details in: /oracle/app/oracle/diag/rdbms/racdb/racdb2/incident/incdir_4320189/racdb2_lmon_363356_i4320189.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2023-11-06T22:24:38.976251+08:00
Errors in file /oracle/app/oracle/diag/rdbms/racdb/racdb2/trace/racdb2_lmon_363356.trc:
ORA-29740: evicted by instance number 1, group incarnation 13
Errors in file /oracle/app/oracle/diag/rdbms/racdb/racdb2/trace/racdb2_lmon_363356.trc  (incident=4320190):
ORA-29740 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /oracle/app/oracle/diag/rdbms/racdb/racdb2/incident/incdir_4320190/racdb2_lmon_363356_i4320190.trc
2023-11-06T22:24:39.693332+08:00
LMON (ospid: 363356): terminating the instance due to ORA error 29740

2023-11-06T22:24:38.077225+08:00

Detected an inconsistent instance membership by instance 1

Errors in file /oracle/app/oracle/diag/rdbms/racdb/racdb2/trace/racdb2_lmon_363356.trc (incident=4320189):

ORA-29740: evicted by instance number 1, group incarnation 13

Incident details in: /oracle/app/oracle/diag/rdbms/racdb/racdb2/incident/incdir_4320189/racdb2_lmon_363356_i4320189.trc

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

2023-11-06T22:24:38.976251+08:00

Errors in file /oracle/app/oracle/diag/rdbms/racdb/racdb2/trace/racdb2_lmon_363356.trc:

ORA-29740: evicted by instance number 1, group incarnation 13

Errors in file /oracle/app/oracle/diag/rdbms/racdb/racdb2/trace/racdb2_lmon_363356.trc (incident=4320190):

ORA-29740 [] [] [] [] [] [] [] [] [] [] [] []

Incident details in: /oracle/app/oracle/diag/rdbms/racdb/racdb2/incident/incdir_4320190/racdb2_lmon_363356_i4320190.trc

2023-11-06T22:24:39.693332+08:00

LMON (ospid: 363356): terminating the instance due to ORA error 29740

22:24:38可以看到2节点被1节点驱逐，lmon产生了trace，我没去看，因为CGS层面的db实例的驱逐本身就是lmon进程去实现的。因为是1节点驱逐的2节点，那么1节点的alert日志是不能漏掉的。

1节点alert：

2023-11-06T22:23:49.624237+08:00
IPC Send timeout to 2.2 inc 11 for msg type 36 from opid 61
2023-11-06T22:23:49.626416+08:00
Communications reconfiguration: instance_number 2 from SMON (ospid: 390100) 
2023-11-06T22:24:38.081438+08:00
Detected an inconsistent instance membership by instance 1
Evicting instance 2 from cluster
Waiting for instances to leave: 2 
2023-11-06T22:24:46.157136+08:00
Increasing priority of 5 RS
Reconfiguration started (old inc 11, new inc 15)

2023-11-06T22:23:49.624237+08:00

IPC Send timeout to 2.2 inc 11 for msg type 36 from opid 61

2023-11-06T22:23:49.626416+08:00

Communications reconfiguration: instance_number 2 from SMON (ospid: 390100)

2023-11-06T22:24:38.081438+08:00

Detected an inconsistent instance membership by instance 1

Evicting instance 2 from cluster

Waiting for instances to leave: 2

2023-11-06T22:24:46.157136+08:00

Increasing priority of 5 RS

Reconfiguration started (old inc 11, new inc 15)

可以看到22:23:49就出现了ipc send timeout，并且发现触发reconfig的原因与SMON有关系，所以查看SMON trace可以发现

*** 2023-11-06T22:23:49.624060+08:00
IPCLW:[0.1]{-}[WAIT]:UTIL: [1699280629623930]  ACNH 0x7fb1108ee260 State: 3 SMSN: 1043354486 PKT(1043402479.1188389702) # Pending: 1
IPCLW:[0.2]{-}[WAIT]:UTIL: [1699280629623930]   Peer: LMS1.KSMSQ_dlm.363374 AckSeq: 1188389701
IPCLW:[0.3]{-}[WAIT]:UTIL: [1699280629623930]   Flags: 0x40000000 IHint: 0x501e46a00000019 THint: 0x5863c1f300000010 
IPCLW:[0.4]{-}[WAIT]:UTIL: [1699280629623930]   Local Address: 169.254.10.192:33365 Remote Address: 169.254.5.139:18617
IPCLW:[0.5]{-}[WAIT]:UTIL: [1699280629623930]   Remote PID: ver 0 flags 1 trans 2 tos 0 opts 0 xdata3 41 xdata2 c135a0d7
IPCLW:[0.6]{-}[WAIT]:UTIL: [1699280629623930]             : mmsz 8288 mmr 9200 mms 2 xdata 3a937d6
IPCLW:[0.7]{-}[WAIT]:UTIL: [1699280629623930]   IVPort: 40790 TVPort: 14294 IMPT: 21890 RMPT: 65   Pending Sends: No Unacked Sends: Yes 
IPCLW:[0.8]{-}[WAIT]:UTIL: [1699280629623930]   Send Engine Queued: No sshdl -1 ssts 0 rtts 0 snderrchk 0 creqcnt 1 credits 7/8
IPCLW:[0.9]{-}[WAIT]:UTIL: [1699280629623930]   Unackd Messages 1043402478 -> 1043402478. SSEQ 1188389701 Send Time: 0:5:0.653.653784 SMSN # Xmits: 1 EMSN 0:5:0.653.653784
IPCLW:[0.10]{-}[WAIT]:UTIL: [1699280629623930]    [0] mbuf 0x7fb11073eaf0 MSN 1043402478 Seq 1188389701 -> 1188389702 Send Time: 0:5:0.653.653784 # XMits: 1
2023-11-06 22:23:49.624 : GSIPC:MSGQCB: msg 0x342388d50 status 1.20, type 36, dest 2.2 seq 0.33419001
2023-11-06 22:23:49.624 : GSIPC:MSGQCB: msg 0x342388d50 send failed inc 11 waited 300653848 usec
2023-11-06 22:23:49.624 : GSIPC:MSGQCB: dest_inc 11  sys_inc 11

*** 2023-11-06T22:23:49.624060+08:00

IPCLW:[0.1]{-}[WAIT]:UTIL: [1699280629623930] ACNH 0x7fb1108ee260 State: 3 SMSN: 1043354486 PKT(1043402479.1188389702) # Pending: 1

IPCLW:[0.2]{-}[WAIT]:UTIL: [1699280629623930] Peer: LMS1.KSMSQ_dlm.363374 AckSeq: 1188389701

IPCLW:[0.3]{-}[WAIT]:UTIL: [1699280629623930] Flags: 0x40000000 IHint: 0x501e46a00000019 THint: 0x5863c1f300000010

IPCLW:[0.4]{-}[WAIT]:UTIL: [1699280629623930] Local Address: 169.254.10.192:33365 Remote Address: 169.254.5.139:18617

IPCLW:[0.5]{-}[WAIT]:UTIL: [1699280629623930] Remote PID: ver 0 flags 1 trans 2 tos 0 opts 0 xdata3 41 xdata2 c135a0d7

IPCLW:[0.6]{-}[WAIT]:UTIL: [1699280629623930] : mmsz 8288 mmr 9200 mms 2 xdata 3a937d6

IPCLW:[0.7]{-}[WAIT]:UTIL: [1699280629623930] IVPort: 40790 TVPort: 14294 IMPT: 21890 RMPT: 65 Pending Sends: No Unacked Sends: Yes

IPCLW:[0.8]{-}[WAIT]:UTIL: [1699280629623930] Send Engine Queued: No sshdl -1 ssts 0 rtts 0 snderrchk 0 creqcnt 1 credits 7/8

IPCLW:[0.9]{-}[WAIT]:UTIL: [1699280629623930] Unackd Messages 1043402478 -> 1043402478. SSEQ 1188389701 Send Time: 0:5:0.653.653784 SMSN # Xmits: 1 EMSN 0:5:0.653.653784

IPCLW:[0.10]{-}[WAIT]:UTIL: [1699280629623930] [0] mbuf 0x7fb11073eaf0 MSN 1043402478 Seq 1188389701 -> 1188389702 Send Time: 0:5:0.653.653784 # XMits: 1

2023-11-06 22:23:49.624 : GSIPC:MSGQCB: msg 0x342388d50 status 1.20, type 36, dest 2.2 seq 0.33419001

2023-11-06 22:23:49.624 : GSIPC:MSGQCB: msg 0x342388d50 send failed inc 11 waited 300653848 usec

2023-11-06 22:23:49.624 : GSIPC:MSGQCB: dest_inc 11 sys_inc 11

IPCLW层面出现了超时，报错的时间与IPC timeout时间一致，GSIPC:MSGQCB: msg 0x342388d50 send failed inc 11 waited 300653848 usec，说明已经超时了300秒。

看看SMON的ASH：

select program,sql_exec_start,event,p1,p2,count(*) from dba_hist_active_sess_history where sample_time between to_date('2023-11-06 21:30:00','yyyy-mm-dd hh24:mi:ss')
and to_date ('2023-11-06 23:30:00','yyyy-mm-dd hh24:mi:ss') and sql_id='679x4qggryd2v' group by program,sql_exec_start,event,p1,p2
PROGRAM                     SQL_EXEC_START           EVENT                                        P1         P2         COUNT(*)
oracle@racdb1 (SMON)        2023/11/6 22:18:48        gc cr failure                                 3        121167        255
oracle@racdb1 (SMON)        2023/11/6 22:18:48        gc cr request                                 3        121167        68

select program,sql_exec_start,event,p1,p2,count(*) from dba_hist_active_sess_history where sample_time between to_date('2023-11-06 21:30:00','yyyy-mm-dd hh24:mi:ss')

and to_date ('2023-11-06 23:30:00','yyyy-mm-dd hh24:mi:ss') and sql_id='679x4qggryd2v' group by program,sql_exec_start,event,p1,p2

PROGRAM SQL_EXEC_START EVENT P1 P2 COUNT(*)

oracle@racdb1 (SMON) 2023/11/6 22:18:48 gc cr failure 3 121167 255

oracle@racdb1 (SMON) 2023/11/6 22:18:48 gc cr request 3 121167 68

SMON从22:18:48一直执行sql 679x4qggryd2v，一直在请求同一个远程cr块，并且有大量的gc cr failure 。从2023/11/6 22:18:48开始到22:23:49正好也是trace中报出的超时300秒。

sql文本为：

SELECT smontabv.cnt
,smontabv.time_mp
,smontab.scn
,smontab.num_mappings
,smontab.tim_scn_map
FROM smon_scn_time smontab
,(
SELECT max(scn) scnmax
,count(*) + sum(NVL2(TIM_SCN_MAP, NUM_MAPPINGS, 0)) cnt
,max(time_mp) time_mp
FROM smon_scn_time
) smontabv
WHERE smontab.scn = smontabv.scnmax

SELECT smontabv.cnt

,smontabv.time_mp

,smontab.scn

,smontab.num_mappings

,smontab.tim_scn_map

FROM smon_scn_time smontab

SELECT max(scn) scnmax

,count(*) + sum(NVL2(TIM_SCN_MAP, NUM_MAPPINGS, 0)) cnt

,max(time_mp) time_mp

FROM smon_scn_time

) smontabv

WHERE smontab.scn = smontabv.scnmax

与smon_scn_time的功能有关系，但这应该不是主要原因，主要原因判断还是与私网有关系。

从osw的traceroute发现，对远程节点的traceroute出现了大量的超时。

zzz ***Mon Nov 6 22:18:24 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.024 ms  0.007 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.048 ms  0.037 ms  0.060 ms
zzz ***Mon Nov 6 22:18:54 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.035 ms  0.006 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.062 ms  0.094 ms  0.088 ms
zzz ***Mon Nov 6 22:19:24 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.030 ms  0.006 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.063 ms  0.056 ms *
zzz ***Mon Nov 6 22:19:54 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.031 ms  0.007 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.059 ms * *
zzz ***Mon Nov 6 22:20:24 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.037 ms  0.011 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  * racdb2-priv (10.10.15.12)  0.074 ms  0.049 ms
zzz ***Mon Nov 6 22:20:54 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.030 ms  0.005 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.063 ms * *
zzz ***Mon Nov 6 22:21:25 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.029 ms  0.006 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.049 ms  0.062 ms  0.089 ms
zzz ***Mon Nov 6 22:21:55 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.026 ms  0.006 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.113 ms * *
zzz ***Mon Nov 6 22:22:25 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.030 ms  0.008 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.062 ms  0.049 ms  0.045 ms
zzz ***Mon Nov 6 22:22:55 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.031 ms  0.006 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  * racdb2-priv (10.10.15.12)  0.044 ms  0.093 ms
zzz ***Mon Nov 6 22:23:25 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.027 ms  0.006 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.095 ms  0.078 ms  0.053 ms
zzz ***Mon Nov 6 22:23:55 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.024 ms  0.007 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  * racdb2-priv (10.10.15.12)  0.082 ms  0.084 ms
zzz ***Mon Nov 6 22:24:25 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.029 ms  0.006 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.057 ms  0.145 ms *
zzz ***Mon Nov 6 22:24:55 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.032 ms  0.006 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.098 ms  0.091 ms  0.085 ms
zzz ***Mon Nov 6 22:25:25 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.032 ms  0.006 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.058 ms  0.088 ms  0.041 ms
zzz ***Mon Nov 6 22:25:55 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.033 ms  0.007 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.056 ms  0.055 ms  0.049 ms
zzz ***Mon Nov 6 22:26:25 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.028 ms  0.006 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.066 ms  0.059 ms  0.066 ms
zzz ***Mon Nov 6 22:26:55 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
1  racdb1-priv (10.10.15.11)  0.038 ms  0.006 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
1  racdb2-priv (10.10.15.12)  0.144 ms *  0.107 ms

zzz ***Mon Nov 6 22:18:24 CST 2023

traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets

1 racdb1-priv (10.10.15.11) 0.024 ms 0.007 ms 0.006 ms