本案例来自西区某客户,数据库版本为12.2,6节点rac,当时应用反应异常缓慢。十万火急,需要立马解决。当我接手时,故障已经持续了3个小时了。
当时查询gv$session时,发现大量会话的final blocking session都是6节点的LCK1进程,该进程等待事件为libcache interrupt action by LCK。由于时间紧急,果断的abort掉了6节点的实例,数据库立马恢复了正常。
当分析原因的时候,发现当时数据库hang得非常严重,连ash和awr都没有产生。但是从diag/dia0/lmhb trace中可以发现一些有价值的信息。
首先LCK是rac非常核心的进程,用于同步GES层面的一些实例级别的lock信息,比如row cache、library cache。而lmbh正好会监控这些核心rac核心进程。
从lmhb trace可以看到:
1 2 3 4 5 6 7 8 |
*** 2022-07-31T21:10:01.053896+08:00 (CDB$ROOT(1)) ============================== LCK1 (ospid: 52025) has not moved for 14 sec (1659273000.1659272986) ... ... *** 2022-07-31T23:25:32.096842+08:00 (CDB$ROOT(1)) ============================== LCK1 (ospid: 52025) has not moved for 8146 sec (1659281132.1659272986) |
LCK1进程已经hang住了,并且持续了很久。并且lmhb中还记录了LCK1的process信息,包含了call stack。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
*** 2022-07-31T21:10:11.326435+08:00 (CDB$ROOT(1)) ============================== LCK1 (ospid: 52025) has not moved for 24 sec (1659273010.1659272986) : heartbeat check status 2 (acceptable) (threshold 70 sec) : heartbeat poke time 0x62e67f24 req 0x120ca4 ack 0x120ca3 freq 10 : heartbeat state 0x41.02 (libcache) pso-flag 0x0 : waiting for event 'libcache interrupt action by LCK' for 19 secs with wait_id 841902246. ===[ Wait Chain ]=== Wait chain is empty. ============================== Dumping PROCESS LCK1 (ospid: 52025) States ============================== ===[ Callstack ]=== *** 2022-07-31T21:10:11.326637+08:00 (CDB$ROOT(1)) Process diagnostic dump for oracle@gapdb4 (LCK1), OS id=52025, pid: 31, proc_ser: 1, sid: 5829, sess_ser: 13529 ------------------------------------------------------------------------------- os thread scheduling delay history: (sampling every 1.000000 secs) 0.000000 secs at [ 21:10:10 ] NOTE: scheduling delay has not been sampled for 0.807660 secs 0.000000 secs from [ 21:10:06 - 21:10:11 ], 5 sec avg 0.000000 secs from [ 21:09:12 - 21:10:11 ], 1 min avg 0.000000 secs from [ 21:05:11 - 21:10:11 ], 5 min avg *** 2022-07-31T21:10:13.735792+08:00 (CDB$ROOT(1)) loadavg : 8.68 7.66 7.18 System user time: 0.11 sys time: 0.01 context switch: 89486 Memory (Avail / Total) = 11748.10M / 515601.66M Swap (Avail / Total) = 49147.25M / 49152.00M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 R oracle 52025 1 2 80 0 - 51757151 - Mar16 ? 2-18:57:53 ora_lck1_paascdba6 Name: ora_lck1_paascd State: R (running) Tgid: 52025 Ngid: 0 Pid: 52025 PPid: 1 TracerPid: 0 Uid: 1100 1100 1100 1100 Gid: 1101 1100 1100 1100 FDSize: 64 Groups: 1100 1101 NStgid: 52025 NSpid: 52025 NSpgid: 52025 NSsid: 52025 VmPeak: 207028604 kB VmSize: 207028604 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 67848 kB VmRSS: 67784 kB VmData: 6264 kB VmStk: 152 kB VmExe: 343608 kB VmLib: 33008 kB VmPTE: 568 kB VmPMD: 168 kB VmSwap: 0 kB Threads: 1 SigQ: 0/2062272 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000016400207 SigCgt: 00000003c9887cf8 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 0000003fffffffff Seccomp: 0 Speculation_Store_Bypass: thread vulnerable Cpus_allowed: ffff,ffffffff,ffffffff Cpus_allowed_list: 0-79 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000000f Mems_allowed_list: 0-3 voluntary_ctxt_switches: 4999124562 nonvoluntary_ctxt_switches: 52927652 Short stack dump: ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-kglobscfix_callback()+64<-kglScanReferences()+222<-kglHandleMessage()+118<-kqlmbivg_cbk()+1647<-kqlmbivg()+391<-kqlmba()+513<-ksbcti()+247<-ksbabs()+551<-ksbrdp()+1079<-opirip()+609<-opidrv()+602<-sou2o()+145<-opimai_real()+202<-ssthrdmain()+417<-main()+262<-__libc_start_main()+245 |
从loadavg,包括当时cpu使用率,内存使用率来看,当时负载其实很低,并且LCK1进程处于R状态。call stack为:kglobscfix_callback()+64<-kglScanReferences()+222<-kglHandleMessage()+118<-kqlmbivg_cbk()+1647<-kqlmbivg()+391<-kqlmba()。负载很低的情况下,LCK1 hang住那么很可能是一个bug。
分析diag trace的systemtate dump,可以找到LCK1进程的session state dump信息。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 |
SO: 0x9e4987468, type: 4, owner: 0xa03c72a88, flag: INIT/-/-/-/0x00 i f: 0x3 c: 0x3 proc=0xa03c72a88, name=session, file=ksu.h LINE:15737, pg=0, conuid=1 SGA version=(1,0) (session) sid: 5829 ser: 13529 trans: (nil), creator: 0xa03c72a88 flags: (0x51) USR/- flags2: (0x409) -/-/INC flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/- 0000-0000-000000000000-0000-00000000 DID: , short-term DID: txn branch: (nil) con_id/con_uid/con_name: 1/1/CDB$ROOT con_logonuid: 1 con_logonid: 1 con_scuid: 1 con_scid: 1 edition#: 0 user#/name: 0/SYS oct: 0, prv: 0, sql: (nil), psql: (nil) stats: 0x97fe0b988, PX stats: 0x110e9744 service name: SYS$BACKGROUND Current Wait Stack: 0: waiting for 'libcache interrupt action by LCK' location=0x1, =0x0, =0x0 wait_id=841902246 seq_num=14388 snap_id=1 wait times: snap=16 min 2 sec, exc=16 min 2 sec, total=16 min 2 sec wait times: max=infinite, heur=16 min 2 sec wait counts: calls=0 os=0 in_wait=1 iflags=0x5a2 There are 565 sessions blocked by this session. Dumping one waiter: inst: 1, sid: 13726, ser: 15144 wait event: 'enq: IV - contention' p1: 'type|mode'=0x49560005 p2: 'id1'=0x53594e43 p3: 'id2'=0x80 row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 min_blocked_time: 949 secs, waiter_cache_ver: 51635 Wait State: fixed_waits=0 flags=0x22 boundary=(nil)/-1 Session Wait History: elapsed time of 0.000002 sec since current wait 0: waited for 'rdbms ipc message' timeout=0xfd, =0x0, =0x0 wait_id=841902245 seq_num=14387 snap_id=1 wait times: snap=0.000318 sec, exc=0.000318 sec, total=0.000318 sec wait times: max=2.530000 sec wait counts: calls=1 os=1 occurred after 0.000001 sec of elapsed time 1: waited for 'libcache interrupt action by LCK' location=0x1, =0x0, =0x0 wait_id=841902244 seq_num=14386 snap_id=1 wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000002 sec of elapsed time 2: waited for 'rdbms ipc message' timeout=0xfd, =0x0, =0x0 wait_id=841902243 seq_num=14385 snap_id=1 wait times: snap=0.000146 sec, exc=0.000146 sec, total=0.000146 sec wait times: max=2.530000 sec wait counts: calls=1 os=1 occurred after 0.000002 sec of elapsed time 3: waited for 'libcache interrupt action by LCK' location=0x1, =0x0, =0x0 wait_id=841902241 seq_num=14384 snap_id=2 wait times: snap=0.000048 sec, exc=0.000079 sec, total=0.000363 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time 4: waited for 'enq: IV - contention' type|mode=0x49560003, id1=0x53594e43, id2=0x80 wait_id=841902242 seq_num=14383 snap_id=1 wait times: snap=0.000284 sec, exc=0.000284 sec, total=0.000284 sec wait times: max=infinite wait counts: calls=3 os=3 occurred after 0.000000 sec of elapsed time 5: waited for 'libcache interrupt action by LCK' location=0x1, =0x0, =0x0 wait_id=841902241 seq_num=14382 snap_id=1 wait times: snap=0.000031 sec, exc=0.000031 sec, total=0.000031 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000002 sec of elapsed time 6: waited for 'rdbms ipc message' timeout=0xfd, =0x0, =0x0 wait_id=841902240 seq_num=14381 snap_id=1 wait times: snap=0.000436 sec, exc=0.000436 sec, total=0.000436 sec wait times: max=2.530000 sec wait counts: calls=1 os=1 occurred after 0.000001 sec of elapsed time 7: waited for 'libcache interrupt action by LCK' location=0x1, =0x0, =0x0 wait_id=841902239 seq_num=14380 snap_id=1 wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000002 sec of elapsed time 8: waited for 'rdbms ipc message' timeout=0xfd, =0x0, =0x0 wait_id=841902238 seq_num=14379 snap_id=1 wait times: snap=0.000097 sec, exc=0.000097 sec, total=0.000097 sec wait times: max=2.530000 sec wait counts: calls=1 os=1 occurred after 0.000001 sec of elapsed time 9: waited for 'libcache interrupt action by LCK' location=0x1, =0x0, =0x0 wait_id=841902237 seq_num=14378 snap_id=1 wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000002 sec of elapsed time Sampled Session History of session 5829 serial 13529 --------------------------------------------------- The sampled session history is constructed by sampling the target session every 1 second. The sampling process captures at each sample if the session is in a non-idle wait, an idle wait, or not in a wait. If the session is in a non-idle wait then one interval is shown for all the samples the session was in the same non-idle wait. If the session is in an idle wait or not in a wait for consecutive samples then one interval is shown for all the consecutive samples. Though we display these consecutive samples in a single interval the session may NOT be continuously idle or not in a wait (the sampling process does not know). The history is displayed in reverse chronological order. sample interval: 1 sec, max history 120 sec --------------------------------------------------- [120 samples, 21:23:54 - 21:25:54] waited for 'libcache interrupt action by LCK', seq_num: 14388 p1: 'location'=0x1 p2: ''=0x0 p3: ''=0x0 time_waited: >= 120 sec (still in wait) --------------------------------------------------- Sampled Session History Summary: longest_non_idle_wait: 'libcache interrupt action by LCK' [120 samples, 21:23:54 - 21:25:54] time_waited: >= 120 sec (still in wait) --------------------------------------------------- ---------------------------------------- kpoxc: mode=0 static=0 inReq=0 pendErrInRep=0 callUseSCN=0 nonReq=0 kpoxc: trace=0 roundTrip=0 errCode=0 ctl=0 eocs=0 disErr=0 dscnEnabled=0 temporary object counter: 0 last pdbid to increment: 0, last function to increment: (null) ---------------------------------------- Virtual Thread: kgskvt: 0x9e9369820, sess: 0x9e4987468/1 sid: 5829 ser: 13529 vc: (nil), proc: 0xa03c72a88, id: 5829 consumer group cur: _ORACLE_BACKGROUND_GROUP_/1 (upd? 0), mapped: _ORACLE_BACKGROUND_GROUP_, orig: vt_state: 0x100, vt_flags: 0x4030, blkrun: 0, numa: 1 inwait: 0, short wait event: 0 posted_run: 0 lastmodrngcnt: 0, lastmodrngcnt_loc: '(null)' lastmodrblcnt: 0, lastmodrblcnt_loc: '(null)' location where insched last set: kgskthrrun location where insched last cleared: kgskthrrun1 location where inwait last set: NULL location where inwait last cleared: NULL is_assigned: 0, in_sched: 0 (0) qcls: (nil), qlink: FALSE vt_active: 0 (pending: 0) vt_pq_active: 0, dop: 0, pq_servers (cur: 0 cg: 0) ps_allocs: 0, pxstmts (act: 0, done: 0 cg: 0) used quanta (usecs): stmt: 0, accum: 0, mapped: 0, tot: 0 exec start consumed time: 0 exec start elapsed time: 0 idle time: 0, active time: 0 (cg: 0) last updnumps: 0, active time (pq: 0 ps: 0) cpu yields: stmt: 0, accum: 0, mapped: 0, tot: 0 cpu waits: stmt: 0, accum: 0, mapped: 0, tot: 0 cpu wait time (usecs): stmt: 0, accum: 0, mapped: 0, tot: 0 ASL queued time outs: 0, time: 0 (cur 0, cg 0) PQQ queued time outs: 0, time: 0 (cur 0, cg 0) Queue timeout violation: 0 calls aborted: 0, num est exec limit hit: 0 undo current: 0k max: 0k KTU Session Commit Cache Dump for IDLs: KTU Session Commit Cache Dump for Non-IDLs: ---------------------------------------- KKS-UOL used : 0 locks(used=0, free=0) KGX Atomic Operation Log 0x85f5c8388 Mutex (nil)(0, 0) idn 0 oper NONE(0) Cursor Pin uid 5829 efd 7 whr 10 slp 0 KGX Atomic Operation Log 0x85f5c83d8 Mutex (nil)(0, 0) idn 0 oper NONE(0) FSO mutex uid 5829 efd 0 whr 0 slp 0 KGX Atomic Operation Log 0x85f5c8428 Mutex (nil)(0, 0) idn 0 oper NONE(0) FSO mutex uid 5829 efd 0 whr 0 slp 0 KGX Atomic Operation Log 0x85f5c8478 Mutex (nil)(0, 0) idn 0 oper NONE(0) FSO mutex uid 5829 efd 0 whr 0 slp 0 ---------------------------------------- KGL-UOL SO Cache(total=0, free=0) KGX Atomic Operation Log 0x83ff5cd70 Mutex 0x6fec4aa30(5829, 0) idn ccf13f6f oper EXCL(6) Library Cache uid 5829 efd 4 whr 148 slp 0 oper=0 pt1=0x6fec4a8e0 pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 Could not acquire mutex...Returning without dumping KGX Atomic Operation Log 0x83ff5cdd8 Mutex 0x6fec4a990(5829, 0) idn 35a7b2c oper EXCL(6) Library Cache uid 5829 efd 4 whr 101 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5ce40 Mutex (nil)(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 65535 whr 134 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5cea8 Mutex 0x390c8e228(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 7 whr 163 slp 0 oper=0 pt1=0x390c8e0d8 pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5cf10 Mutex (nil)(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 0 whr 0 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5cf78 Mutex (nil)(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 0 whr 0 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5cfe0 Mutex (nil)(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 0 whr 0 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5d048 Mutex (nil)(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 0 whr 0 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5d0b0 Mutex (nil)(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 0 whr 0 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 |
可以看到当时LCK1已经以EXCL模式持有了两个mutex,还有一句Could not acquire mutex…Returning without dumping 。没见过,看着像还在请求其他mutex,并且失败。
搜索mos发现与Bug 30384121 – lck process hang causing entire database to hang (Doc ID 30384121.8)非常匹配,包括关键的call stack信息。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
@ I think that we are more likely to be looping in the kglHandleMessage() while @ loop as we might spin if kglobscfix_callback() cannot acquire the mutex. @ RELEASE NOTES: @ ]] It was possible for LCKx to block other sessions when it was trying to @ ]] acquire handle and dependency mutexes. This problem has now been @ ]] corrected. @ ]] @ . @ INTERNAL PROBLEM DESCRIPTION: @ Under rare conditions LCKx might repeatedly execute the following while @ loop @ because it is unable to get a handle mutex in nowait mode under @ kglobscfix_callback(); @ . @ while (kglScanDependents(gp, hd, kglobscfix_callback, (void *)0)); @ . @ and as we immediately re-get the dependency mutex under @ kglScanDependents->kglScanReferences but fail to get the handle mutex under @ kglobscfix_callback() we can block other sessions that try to acquire the @ dependency mutex but can give up and retry (as is the case for CLMN/PMON). @ @ . @ INTERNAL FIX DESCRIPTION: @ The fix is to amend the while loop so that we enter a small wait which @ should @ then allow blocked processes the ability to acquire the relevant mutex. |
bug描述与trace发现的信息几乎完全匹配。基本可以确定就是这个bug引起的故障。
数据库的patch信息为:
1 2 3 4 5 6 7 8 9 |
oracle@gapdb1:/home/oracle> opatch lspatches 28865544; 29123189; 27751755; 25163866; 23333567; 28790651;OJVM RELEASE UPDATE: 12.2.0.1.190115 (28790651) 28870605;OCW JAN 2019 RELEASE UPDATE 12.2.0.1.190115 (28870605) 28822515;Database Jan 2019 Release Update : 12.2.0.1.190115 (28822515) |
而该bug将在12.2的202007这个ru中修复。
over!!!