案例：troubleshooting LCK PROCESS HANG CAUSING ENTIRE DATABASE TO HANG

本案例来自西区某客户，数据库版本为12.2，6节点rac，当时应用反应异常缓慢。十万火急，需要立马解决。当我接手时，故障已经持续了3个小时了。

当时查询gv$session时，发现大量会话的final blocking session都是6节点的LCK1进程，该进程等待事件为libcache interrupt action by LCK。由于时间紧急，果断的abort掉了6节点的实例，数据库立马恢复了正常。

当分析原因的时候，发现当时数据库hang得非常严重，连ash和awr都没有产生。但是从diag/dia0/lmhb trace中可以发现一些有价值的信息。

首先LCK是rac非常核心的进程，用于同步GES层面的一些实例级别的lock信息，比如row cache、library cache。而lmbh正好会监控这些核心rac核心进程。

从lmhb trace可以看到：

*** 2022-07-31T21:10:01.053896+08:00 (CDB$ROOT(1))
==============================
LCK1 (ospid: 52025) has not moved for 14 sec (1659273000.1659272986)
...
...
*** 2022-07-31T23:25:32.096842+08:00 (CDB$ROOT(1))
==============================
LCK1 (ospid: 52025) has not moved for 8146 sec (1659281132.1659272986)

*** 2022-07-31T21:10:01.053896+08:00 (CDB$ROOT(1))

==============================

LCK1 (ospid: 52025) has not moved for 14 sec (1659273000.1659272986)

...

*** 2022-07-31T23:25:32.096842+08:00 (CDB$ROOT(1))

==============================

LCK1 (ospid: 52025) has not moved for 8146 sec (1659281132.1659272986)

LCK1进程已经hang住了，并且持续了很久。并且lmhb中还记录了LCK1的process信息，包含了call stack。

*** 2022-07-31T21:10:11.326435+08:00 (CDB$ROOT(1))
==============================
LCK1 (ospid: 52025) has not moved for 24 sec (1659273010.1659272986)
: heartbeat check status 2 (acceptable) (threshold 70 sec)
: heartbeat poke time 0x62e67f24 req 0x120ca4 ack 0x120ca3 freq 10
: heartbeat state 0x41.02 (libcache) pso-flag 0x0
: waiting for event 'libcache interrupt action by LCK' for 19 secs with wait_id 841902246.
===[ Wait Chain ]===
Wait chain is empty.
==============================
Dumping PROCESS LCK1 (ospid: 52025) States
==============================
===[ Callstack ]===
*** 2022-07-31T21:10:11.326637+08:00 (CDB$ROOT(1))
Process diagnostic dump for oracle@gapdb4 (LCK1), OS id=52025,
pid: 31, proc_ser: 1, sid: 5829, sess_ser: 13529 
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
0.000000 secs at [ 21:10:10 ]
NOTE: scheduling delay has not been sampled for 0.807660 secs
0.000000 secs from [ 21:10:06 - 21:10:11 ], 5 sec avg
0.000000 secs from [ 21:09:12 - 21:10:11 ], 1 min avg
0.000000 secs from [ 21:05:11 - 21:10:11 ], 5 min avg
*** 2022-07-31T21:10:13.735792+08:00 (CDB$ROOT(1))
loadavg : 8.68 7.66 7.18
System user time: 0.11 sys time: 0.01 context switch: 89486
Memory (Avail / Total) = 11748.10M / 515601.66M
Swap (Avail / Total) = 49147.25M /  49152.00M
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 R oracle   52025     1  2  80   0 - 51757151 -   Mar16 ?        2-18:57:53 ora_lck1_paascdba6
Name:	ora_lck1_paascd
State:	R (running)
Tgid:	52025
Ngid:	0
Pid:	52025
PPid:	1
TracerPid:	0
Uid:	1100	1100	1100	1100
Gid:	1101	1100	1100	1100
FDSize:	64
Groups:	1100 1101 
NStgid:	52025
NSpid:	52025
NSpgid:	52025
NSsid:	52025
VmPeak:	207028604 kB
VmSize:	207028604 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	   67848 kB
VmRSS:	   67784 kB
VmData:	    6264 kB
VmStk:	     152 kB
VmExe:	  343608 kB
VmLib:	   33008 kB
VmPTE:	     568 kB
VmPMD:	     168 kB
VmSwap:	       0 kB
Threads:	1
SigQ:	0/2062272
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000016400207
SigCgt:	00000003c9887cf8
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	0000003fffffffff
Seccomp:	0
Speculation_Store_Bypass:	thread vulnerable
Cpus_allowed:	ffff,ffffffff,ffffffff
Cpus_allowed_list:	0-79
Mems_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000000f
Mems_allowed_list:	0-3
voluntary_ctxt_switches:	4999124562
nonvoluntary_ctxt_switches:	52927652
Short stack dump: 
ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-kglobscfix_callback()+64<-kglScanReferences()+222<-kglHandleMessage()+118<-kqlmbivg_cbk()+1647<-kqlmbivg()+391<-kqlmba()+513<-ksbcti()+247<-ksbabs()+551<-ksbrdp()+1079<-opirip()+609<-opidrv()+602<-sou2o()+145<-opimai_real()+202<-ssthrdmain()+417<-main()+262<-__libc_start_main()+245

*** 2022-07-31T21:10:11.326435+08:00 (CDB$ROOT(1))

==============================

LCK1 (ospid: 52025) has not moved for 24 sec (1659273010.1659272986)

: heartbeat check status 2 (acceptable) (threshold 70 sec)

: heartbeat poke time 0x62e67f24 req 0x120ca4 ack 0x120ca3 freq 10

: heartbeat state 0x41.02 (libcache) pso-flag 0x0

: waiting for event 'libcache interrupt action by LCK' for 19 secs with wait_id 841902246.

===[ Wait Chain ]===

Wait chain is empty.

==============================

Dumping PROCESS LCK1 (ospid: 52025) States

==============================

===[ Callstack ]===

*** 2022-07-31T21:10:11.326637+08:00 (CDB$ROOT(1))

Process diagnostic dump for oracle@gapdb4 (LCK1), OS id=52025,

pid: 31, proc_ser: 1, sid: 5829, sess_ser: 13529

-------------------------------------------------------------------------------

os thread scheduling delay history: (sampling every 1.000000 secs)

0.000000 secs at [ 21:10:10 ]

NOTE: scheduling delay has not been sampled for 0.807660 secs

0.000000 secs from [ 21:10:06 - 21:10:11 ], 5 sec avg

0.000000 secs from [ 21:09:12 - 21:10:11 ], 1 min avg

0.000000 secs from [ 21:05:11 - 21:10:11 ], 5 min avg

*** 2022-07-31T21:10:13.735792+08:00 (CDB$ROOT(1))

loadavg : 8.68 7.66 7.18

System user time: 0.11 sys time: 0.01 context switch: 89486

Memory (Avail / Total) = 11748.10M / 515601.66M

Swap (Avail / Total) = 49147.25M / 49152.00M

F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD

0 R oracle 52025 1 2 80 0 - 51757151 - Mar16 ? 2-18:57:53 ora_lck1_paascdba6

Name: ora_lck1_paascd

State: R (running)

Tgid: 52025

Ngid: 0

Pid: 52025

PPid: 1

TracerPid: 0

Uid: 1100 1100 1100 1100

Gid: 1101 1100 1100 1100

FDSize: 64

Groups: 1100 1101

NStgid: 52025

NSpid: 52025

NSpgid: 52025

NSsid: 52025

VmPeak: 207028604 kB

VmSize: 207028604 kB

VmLck: 0 kB

VmPin: 0 kB

VmHWM: 67848 kB

VmRSS: 67784 kB

VmData: 6264 kB

VmStk: 152 kB

VmExe: 343608 kB

VmLib: 33008 kB

VmPTE: 568 kB

VmPMD: 168 kB

VmSwap: 0 kB

Threads: 1

SigQ: 0/2062272

SigPnd: 0000000000000000

ShdPnd: 0000000000000000

SigBlk: 0000000000000000

SigIgn: 0000000016400207

SigCgt: 00000003c9887cf8

CapInh: 0000000000000000

CapPrm: 0000000000000000

CapEff: 0000000000000000

CapBnd: 0000003fffffffff

Seccomp: 0

Speculation_Store_Bypass: thread vulnerable

Cpus_allowed: ffff,ffffffff,ffffffff

Cpus_allowed_list: 0-79

Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000000f

Mems_allowed_list: 0-3

voluntary_ctxt_switches: 4999124562

nonvoluntary_ctxt_switches: 52927652

Short stack dump:

ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-kglobscfix_callback()+64<-kglScanReferences()+222<-kglHandleMessage()+118<-kqlmbivg_cbk()+1647<-kqlmbivg()+391<-kqlmba()+513<-ksbcti()+247<-ksbabs()+551<-ksbrdp()+1079<-opirip()+609<-opidrv()+602<-sou2o()+145<-opimai_real()+202<-ssthrdmain()+417<-main()+262<-__libc_start_main()+245

从loadavg，包括当时cpu使用率，内存使用率来看，当时负载其实很低，并且LCK1进程处于R状态。call stack为：kglobscfix_callback()+64<-kglScanReferences()+222<-kglHandleMessage()+118<-kqlmbivg_cbk()+1647<-kqlmbivg()+391<-kqlmba()。负载很低的情况下，LCK1 hang住那么很可能是一个bug。

分析diag trace的systemtate dump，可以找到LCK1进程的session state dump信息。

SO: 0x9e4987468, type: 4, owner: 0xa03c72a88, flag: INIT/-/-/-/0x00 i f: 0x3 c: 0x3
proc=0xa03c72a88, name=session, file=ksu.h LINE:15737, pg=0, conuid=1
SGA version=(1,0)
(session) sid: 5829 ser: 13529 trans: (nil), creator: 0xa03c72a88
flags: (0x51) USR/- flags2: (0x409) -/-/INC
flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/-
0000-0000-000000000000-0000-00000000              DID: , short-term DID: 
txn branch: (nil)
con_id/con_uid/con_name: 1/1/CDB$ROOT
con_logonuid: 1 con_logonid: 1
con_scuid: 1 con_scid: 1
edition#: 0              user#/name: 0/SYS
oct: 0, prv: 0, sql: (nil), psql: (nil)
stats: 0x97fe0b988, PX stats: 0x110e9744
service name: SYS$BACKGROUND
Current Wait Stack:
0: waiting for 'libcache interrupt action by LCK'
location=0x1, =0x0, =0x0
wait_id=841902246 seq_num=14388 snap_id=1
wait times: snap=16 min 2 sec, exc=16 min 2 sec, total=16 min 2 sec
wait times: max=infinite, heur=16 min 2 sec
wait counts: calls=0 os=0
in_wait=1 iflags=0x5a2
There are 565 sessions blocked by this session.
Dumping one waiter:
inst: 1, sid: 13726, ser: 15144
wait event: 'enq: IV -  contention'
p1: 'type|mode'=0x49560005
p2: 'id1'=0x53594e43
p3: 'id2'=0x80
row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
min_blocked_time: 949 secs, waiter_cache_ver: 51635
Wait State:
fixed_waits=0 flags=0x22 boundary=(nil)/-1
Session Wait History:
elapsed time of 0.000002 sec since current wait
0: waited for 'rdbms ipc message'
timeout=0xfd, =0x0, =0x0
wait_id=841902245 seq_num=14387 snap_id=1
wait times: snap=0.000318 sec, exc=0.000318 sec, total=0.000318 sec
wait times: max=2.530000 sec
wait counts: calls=1 os=1
occurred after 0.000001 sec of elapsed time
1: waited for 'libcache interrupt action by LCK'
location=0x1, =0x0, =0x0
wait_id=841902244 seq_num=14386 snap_id=1
wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000002 sec of elapsed time
2: waited for 'rdbms ipc message'
timeout=0xfd, =0x0, =0x0
wait_id=841902243 seq_num=14385 snap_id=1
wait times: snap=0.000146 sec, exc=0.000146 sec, total=0.000146 sec
wait times: max=2.530000 sec
wait counts: calls=1 os=1
occurred after 0.000002 sec of elapsed time
3: waited for 'libcache interrupt action by LCK'
location=0x1, =0x0, =0x0
wait_id=841902241 seq_num=14384 snap_id=2
wait times: snap=0.000048 sec, exc=0.000079 sec, total=0.000363 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000000 sec of elapsed time
4: waited for 'enq: IV -  contention'
type|mode=0x49560003, id1=0x53594e43, id2=0x80
wait_id=841902242 seq_num=14383 snap_id=1
wait times: snap=0.000284 sec, exc=0.000284 sec, total=0.000284 sec
wait times: max=infinite
wait counts: calls=3 os=3
occurred after 0.000000 sec of elapsed time
5: waited for 'libcache interrupt action by LCK'
location=0x1, =0x0, =0x0
wait_id=841902241 seq_num=14382 snap_id=1
wait times: snap=0.000031 sec, exc=0.000031 sec, total=0.000031 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000002 sec of elapsed time
6: waited for 'rdbms ipc message'
timeout=0xfd, =0x0, =0x0
wait_id=841902240 seq_num=14381 snap_id=1
wait times: snap=0.000436 sec, exc=0.000436 sec, total=0.000436 sec
wait times: max=2.530000 sec
wait counts: calls=1 os=1
occurred after 0.000001 sec of elapsed time
7: waited for 'libcache interrupt action by LCK'
location=0x1, =0x0, =0x0
wait_id=841902239 seq_num=14380 snap_id=1
wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000002 sec of elapsed time
8: waited for 'rdbms ipc message'
timeout=0xfd, =0x0, =0x0
wait_id=841902238 seq_num=14379 snap_id=1
wait times: snap=0.000097 sec, exc=0.000097 sec, total=0.000097 sec
wait times: max=2.530000 sec
wait counts: calls=1 os=1
occurred after 0.000001 sec of elapsed time
9: waited for 'libcache interrupt action by LCK'
location=0x1, =0x0, =0x0
wait_id=841902237 seq_num=14378 snap_id=1
wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000002 sec of elapsed time
Sampled Session History of session 5829 serial 13529
---------------------------------------------------
The sampled session history is constructed by sampling
the target session every 1 second. The sampling process
captures at each sample if the session is in a non-idle wait,
an idle wait, or not in a wait. If the session is in a
non-idle wait then one interval is shown for all the samples
the session was in the same non-idle wait. If the
session is in an idle wait or not in a wait for
consecutive samples then one interval is shown for all
the consecutive samples. Though we display these consecutive
samples  in a single interval the session may NOT be continuously
idle or not in a wait (the sampling process does not know).
The history is displayed in reverse chronological order.
sample interval: 1 sec, max history 120 sec
---------------------------------------------------
[120 samples,                                        21:23:54 - 21:25:54]
waited for 'libcache interrupt action by LCK', seq_num: 14388
p1: 'location'=0x1
p2: ''=0x0
p3: ''=0x0
time_waited: >= 120 sec (still in wait)
---------------------------------------------------
Sampled Session History Summary:
longest_non_idle_wait: 'libcache interrupt action by LCK'
[120 samples, 21:23:54 - 21:25:54]
time_waited: >= 120 sec (still in wait)
---------------------------------------------------
----------------------------------------
kpoxc: mode=0 static=0 inReq=0 pendErrInRep=0 callUseSCN=0 nonReq=0
kpoxc:  trace=0 roundTrip=0 errCode=0 ctl=0 eocs=0 disErr=0 dscnEnabled=0
temporary object counter: 0
last pdbid to increment: 0, last function to increment: (null)
----------------------------------------
Virtual Thread:
kgskvt: 0x9e9369820, sess: 0x9e4987468/1 sid: 5829 ser: 13529
vc: (nil), proc: 0xa03c72a88, id: 5829
consumer group cur: _ORACLE_BACKGROUND_GROUP_/1 (upd? 0), mapped: _ORACLE_BACKGROUND_GROUP_, orig: 
vt_state: 0x100, vt_flags: 0x4030, blkrun: 0, numa: 1
inwait: 0, short wait event: 0 posted_run: 0
lastmodrngcnt: 0, lastmodrngcnt_loc: '(null)'
lastmodrblcnt: 0, lastmodrblcnt_loc: '(null)'
location where insched last set: kgskthrrun
location where insched last cleared: kgskthrrun1
location where inwait last set: NULL
location where inwait last cleared: NULL
is_assigned: 0, in_sched: 0 (0)
qcls: (nil), qlink: FALSE
vt_active: 0 (pending: 0)
vt_pq_active: 0, dop: 0, pq_servers (cur: 0 cg: 0)
ps_allocs: 0, pxstmts (act: 0, done: 0 cg: 0)
used quanta (usecs):
stmt: 0, accum: 0, mapped: 0, tot: 0
exec start consumed time: 0
exec start elapsed time: 0
idle time: 0, active time: 0 (cg: 0)
last updnumps: 0, active time (pq: 0 ps: 0)
cpu yields:
stmt: 0, accum: 0, mapped: 0, tot: 0
cpu waits:
stmt: 0, accum: 0, mapped: 0, tot: 0
cpu wait time (usecs):
stmt: 0, accum: 0, mapped: 0, tot: 0
ASL queued time outs: 0, time: 0 (cur 0, cg 0)
PQQ queued time outs: 0, time: 0 (cur 0, cg 0)
Queue timeout violation: 0
calls aborted: 0, num est exec limit hit: 0
undo current: 0k max: 0k
KTU Session Commit Cache Dump for IDLs: 
KTU Session Commit Cache Dump for Non-IDLs: 
----------------------------------------
KKS-UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log 0x85f5c8388
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Cursor Pin uid 5829 efd 7 whr 10 slp 0
KGX Atomic Operation Log 0x85f5c83d8
Mutex (nil)(0, 0) idn 0 oper NONE(0)
FSO mutex uid 5829 efd 0 whr 0 slp 0
KGX Atomic Operation Log 0x85f5c8428
Mutex (nil)(0, 0) idn 0 oper NONE(0)
FSO mutex uid 5829 efd 0 whr 0 slp 0
KGX Atomic Operation Log 0x85f5c8478
Mutex (nil)(0, 0) idn 0 oper NONE(0)
FSO mutex uid 5829 efd 0 whr 0 slp 0
----------------------------------------
KGL-UOL SO Cache(total=0, free=0)
KGX Atomic Operation Log 0x83ff5cd70
Mutex 0x6fec4aa30(5829, 0) idn ccf13f6f oper EXCL(6)
Library Cache uid 5829 efd 4 whr 148 slp 0
oper=0 pt1=0x6fec4a8e0 pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
Could not acquire mutex...Returning without dumping
KGX Atomic Operation Log 0x83ff5cdd8
Mutex 0x6fec4a990(5829, 0) idn 35a7b2c oper EXCL(6)
Library Cache uid 5829 efd 4 whr 101 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5ce40
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 65535 whr 134 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5cea8
Mutex 0x390c8e228(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 7 whr 163 slp 0
oper=0 pt1=0x390c8e0d8 pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5cf10
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 0 whr 0 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5cf78
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 0 whr 0 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5cfe0
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 0 whr 0 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5d048
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 0 whr 0 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5d0b0
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 0 whr 0 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

SO: 0x9e4987468, type: 4, owner: 0xa03c72a88, flag: INIT/-/-/-/0x00 i f: 0x3 c: 0x3

proc=0xa03c72a88, name=session, file=ksu.h LINE:15737, pg=0, conuid=1

SGA version=(1,0)

(session) sid: 5829 ser: 13529 trans: (nil), creator: 0xa03c72a88

flags: (0x51) USR/- flags2: (0x409) -/-/INC

flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/-

0000-0000-000000000000-0000-00000000 DID: , short-term DID:

txn branch: (nil)

con_id/con_uid/con_name: 1/1/CDB$ROOT

con_logonuid: 1 con_logonid: 1

con_scuid: 1 con_scid: 1

edition#: 0 user#/name: 0/SYS

oct: 0, prv: 0, sql: (nil), psql: (nil)

stats: 0x97fe0b988, PX stats: 0x110e9744

service name: SYS$BACKGROUND

Current Wait Stack:

0: waiting for 'libcache interrupt action by LCK'

location=0x1, =0x0, =0x0

wait_id=841902246 seq_num=14388 snap_id=1

wait times: snap=16 min 2 sec, exc=16 min 2 sec, total=16 min 2 sec

wait times: max=infinite, heur=16 min 2 sec

wait counts: calls=0 os=0

in_wait=1 iflags=0x5a2

There are 565 sessions blocked by this session.

Dumping one waiter:

inst: 1, sid: 13726, ser: 15144

wait event: 'enq: IV - contention'

p1: 'type|mode'=0x49560005

p2: 'id1'=0x53594e43

p3: 'id2'=0x80

row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0

min_blocked_time: 949 secs, waiter_cache_ver: 51635

Wait State:

fixed_waits=0 flags=0x22 boundary=(nil)/-1

Session Wait History:

elapsed time of 0.000002 sec since current wait

0: waited for 'rdbms ipc message'

timeout=0xfd, =0x0, =0x0

wait_id=841902245 seq_num=14387 snap_id=1

wait times: snap=0.000318 sec, exc=0.000318 sec, total=0.000318 sec

wait times: max=2.530000 sec

wait counts: calls=1 os=1

occurred after 0.000001 sec of elapsed time

1: waited for 'libcache interrupt action by LCK'

location=0x1, =0x0, =0x0

wait_id=841902244 seq_num=14386 snap_id=1

wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec

wait times: max=infinite

wait counts: calls=0 os=0

occurred after 0.000002 sec of elapsed time

2: waited for 'rdbms ipc message'

timeout=0xfd, =0x0, =0x0

wait_id=841902243 seq_num=14385 snap_id=1

wait times: snap=0.000146 sec, exc=0.000146 sec, total=0.000146 sec

wait times: max=2.530000 sec

wait counts: calls=1 os=1

occurred after 0.000002 sec of elapsed time

3: waited for 'libcache interrupt action by LCK'

location=0x1, =0x0, =0x0

wait_id=841902241 seq_num=14384 snap_id=2

wait times: snap=0.000048 sec, exc=0.000079 sec, total=0.000363 sec

wait times: max=infinite

wait counts: calls=0 os=0

occurred after 0.000000 sec of elapsed time

4: waited for 'enq: IV - contention'

type|mode=0x49560003, id1=0x53594e43, id2=0x80

wait_id=841902242 seq_num=14383 snap_id=1

wait times: snap=0.000284 sec, exc=0.000284 sec, total=0.000284 sec

wait times: max=infinite

wait counts: calls=3 os=3

occurred after 0.000000 sec of elapsed time

5: waited for 'libcache interrupt action by LCK'

location=0x1, =0x0, =0x0

wait_id=841902241 seq_num=14382 snap_id=1

wait times: snap=0.000031 sec, exc=0.000031 sec, total=0.000031 sec

wait times: max=infinite

wait counts: calls=0 os=0

occurred after 0.000002 sec of elapsed time

6: waited for 'rdbms ipc message'

timeout=0xfd, =0x0, =0x0

wait_id=841902240 seq_num=14381 snap_id=1

wait times: snap=0.000436 sec, exc=0.000436 sec, total=0.000436 sec

wait times: max=2.530000 sec

wait counts: calls=1 os=1

occurred after 0.000001 sec of elapsed time

7: waited for 'libcache interrupt action by LCK'

location=0x1, =0x0, =0x0

wait_id=841902239 seq_num=14380 snap_id=1

wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec

wait times: max=infinite

wait counts: calls=0 os=0

occurred after 0.000002 sec of elapsed time

8: waited for 'rdbms ipc message'

timeout=0xfd, =0x0, =0x0

wait_id=841902238 seq_num=14379 snap_id=1

wait times: snap=0.000097 sec, exc=0.000097 sec, total=0.000097 sec

wait times: max=2.530000 sec

wait counts: calls=1 os=1

occurred after 0.000001 sec of elapsed time

9: waited for 'libcache interrupt action by LCK'

location=0x1, =0x0, =0x0

wait_id=841902237 seq_num=14378 snap_id=1

wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec

wait times: max=infinite

wait counts: calls=0 os=0

occurred after 0.000002 sec of elapsed time

Sampled Session History of session 5829 serial 13529

---------------------------------------------------

The sampled session history is constructed by sampling

the target session every 1 second. The sampling process

captures at each sample if the session is in a non-idle wait,

an idle wait, or not in a wait. If the session is in a

non-idle wait then one interval is shown for all the samples

the session was in the same non-idle wait. If the

session is in an idle wait or not in a wait for

consecutive samples then one interval is shown for all

the consecutive samples. Though we display these consecutive

samples in a single interval the session may NOT be continuously

idle or not in a wait (the sampling process does not know).

The history is displayed in reverse chronological order.

sample interval: 1 sec, max history 120 sec

---------------------------------------------------

[120 samples, 21:23:54 - 21:25:54]

waited for 'libcache interrupt action by LCK', seq_num: 14388

p1: 'location'=0x1

p2: ''=0x0

p3: ''=0x0

time_waited: >= 120 sec (still in wait)

---------------------------------------------------

Sampled Session History Summary:

longest_non_idle_wait: 'libcache interrupt action by LCK'

[120 samples, 21:23:54 - 21:25:54]

time_waited: >= 120 sec (still in wait)

---------------------------------------------------

----------------------------------------

kpoxc: mode=0 static=0 inReq=0 pendErrInRep=0 callUseSCN=0 nonReq=0

kpoxc: trace=0 roundTrip=0 errCode=0 ctl=0 eocs=0 disErr=0 dscnEnabled=0

temporary object counter: 0

last pdbid to increment: 0, last function to increment: (null)

----------------------------------------

Virtual Thread:

kgskvt: 0x9e9369820, sess: 0x9e4987468/1 sid: 5829 ser: 13529

vc: (nil), proc: 0xa03c72a88, id: 5829

consumer group cur: _ORACLE_BACKGROUND_GROUP_/1 (upd? 0), mapped: _ORACLE_BACKGROUND_GROUP_, orig:

vt_state: 0x100, vt_flags: 0x4030, blkrun: 0, numa: 1

inwait: 0, short wait event: 0 posted_run: 0

lastmodrngcnt: 0, lastmodrngcnt_loc: '(null)'

lastmodrblcnt: 0, lastmodrblcnt_loc: '(null)'

location where insched last set: kgskthrrun

location where insched last cleared: kgskthrrun1

location where inwait last set: NULL

location where inwait last cleared: NULL

is_assigned: 0, in_sched: 0 (0)

qcls: (nil), qlink: FALSE

vt_active: 0 (pending: 0)

vt_pq_active: 0, dop: 0, pq_servers (cur: 0 cg: 0)

ps_allocs: 0, pxstmts (act: 0, done: 0 cg: 0)

used quanta (usecs):

stmt: 0, accum: 0, mapped: 0, tot: 0

exec start consumed time: 0

exec start elapsed time: 0

idle time: 0, active time: 0 (cg: 0)

last updnumps: 0, active time (pq: 0 ps: 0)

cpu yields:

stmt: 0, accum: 0, mapped: 0, tot: 0

cpu waits:

stmt: 0, accum: 0, mapped: 0, tot: 0

cpu wait time (usecs):

stmt: 0, accum: 0, mapped: 0, tot: 0

ASL queued time outs: 0, time: 0 (cur 0, cg 0)

PQQ queued time outs: 0, time: 0 (cur 0, cg 0)

Queue timeout violation: 0

calls aborted: 0, num est exec limit hit: 0

undo current: 0k max: 0k

KTU Session Commit Cache Dump for IDLs:

KTU Session Commit Cache Dump for Non-IDLs:

----------------------------------------

KKS-UOL used : 0 locks(used=0, free=0)

KGX Atomic Operation Log 0x85f5c8388

Mutex (nil)(0, 0) idn 0 oper NONE(0)

Cursor Pin uid 5829 efd 7 whr 10 slp 0

KGX Atomic Operation Log 0x85f5c83d8

Mutex (nil)(0, 0) idn 0 oper NONE(0)

FSO mutex uid 5829 efd 0 whr 0 slp 0

KGX Atomic Operation Log 0x85f5c8428

Mutex (nil)(0, 0) idn 0 oper NONE(0)

FSO mutex uid 5829 efd 0 whr 0 slp 0

KGX Atomic Operation Log 0x85f5c8478

Mutex (nil)(0, 0) idn 0 oper NONE(0)

FSO mutex uid 5829 efd 0 whr 0 slp 0

----------------------------------------

KGL-UOL SO Cache(total=0, free=0)

KGX Atomic Operation Log 0x83ff5cd70

Mutex 0x6fec4aa30(5829, 0) idn ccf13f6f oper EXCL(6)

Library Cache uid 5829 efd 4 whr 148 slp 0

oper=0 pt1=0x6fec4a8e0 pt2=(nil) pt3=(nil)

pt4=(nil) pt5=(nil) ub4=0 flg=0x0

msk=0000-0000-0000-0000-0000

Could not acquire mutex...Returning without dumping

KGX Atomic Operation Log 0x83ff5cdd8

Mutex 0x6fec4a990(5829, 0) idn 35a7b2c oper EXCL(6)

Library Cache uid 5829 efd 4 whr 101 slp 0

oper=0 pt1=(nil) pt2=(nil) pt3=(nil)

pt4=(nil) pt5=(nil) ub4=0 flg=0x0

msk=0000-0000-0000-0000-0000

KGX Atomic Operation Log 0x83ff5ce40

Mutex (nil)(0, 0) idn 0 oper NONE(0)

Library Cache uid 5829 efd 65535 whr 134 slp 0

oper=0 pt1=(nil) pt2=(nil) pt3=(nil)

pt4=(nil) pt5=(nil) ub4=0 flg=0x0

msk=0000-0000-0000-0000-0000

KGX Atomic Operation Log 0x83ff5cea8

Mutex 0x390c8e228(0, 0) idn 0 oper NONE(0)

Library Cache uid 5829 efd 7 whr 163 slp 0

oper=0 pt1=0x390c8e0d8 pt2=(nil) pt3=(nil)

pt4=(nil) pt5=(nil) ub4=0 flg=0x0

msk=0000-0000-0000-0000-0000

KGX Atomic Operation Log 0x83ff5cf10

Mutex (nil)(0, 0) idn 0 oper NONE(0)

Library Cache uid 5829 efd 0 whr 0 slp 0

oper=0 pt1=(nil) pt2=(nil) pt3=(nil)

pt4=(nil) pt5=(nil) ub4=0 flg=0x0

msk=0000-0000-0000-0000-0000

KGX Atomic Operation Log 0x83ff5cf78

Mutex (nil)(0, 0) idn 0 oper NONE(0)

Library Cache uid 5829 efd 0 whr 0 slp 0

oper=0 pt1=(nil) pt2=(nil) pt3=(nil)

pt4=(nil) pt5=(nil) ub4=0 flg=0x0

msk=0000-0000-0000-0000-0000

KGX Atomic Operation Log 0x83ff5cfe0

Mutex (nil)(0, 0) idn 0 oper NONE(0)

Library Cache uid 5829 efd 0 whr 0 slp 0

oper=0 pt1=(nil) pt2=(nil) pt3=(nil)

pt4=(nil) pt5=(nil) ub4=0 flg=0x0

msk=0000-0000-0000-0000-0000

KGX Atomic Operation Log 0x83ff5d048

Mutex (nil)(0, 0) idn 0 oper NONE(0)

Library Cache uid 5829 efd 0 whr 0 slp 0

oper=0 pt1=(nil) pt2=(nil) pt3=(nil)

pt4=(nil) pt5=(nil) ub4=0 flg=0x0

msk=0000-0000-0000-0000-0000

KGX Atomic Operation Log 0x83ff5d0b0

Mutex (nil)(0, 0) idn 0 oper NONE(0)

Library Cache uid 5829 efd 0 whr 0 slp 0

oper=0 pt1=(nil) pt2=(nil) pt3=(nil)

pt4=(nil) pt5=(nil) ub4=0 flg=0x0

msk=0000-0000-0000-0000-0000

可以看到当时LCK1已经以EXCL模式持有了两个mutex，还有一句Could not acquire mutex…Returning without dumping 。没见过，看着像还在请求其他mutex，并且失败。

搜索mos发现与Bug 30384121 – lck process hang causing entire database to hang (Doc ID 30384121.8)非常匹配，包括关键的call stack信息。

@ I think that we are more likely to be looping in the kglHandleMessage() while 
@ loop as we might spin if kglobscfix_callback() cannot acquire the mutex. 
@ RELEASE NOTES:
@ ]] It was possible for LCKx to block other sessions when it was trying to 
@ ]] acquire handle and dependency mutexes. This problem has now been
@ ]] corrected.
@ ]]  
@ .
@ INTERNAL PROBLEM DESCRIPTION:
@ Under rare conditions LCKx might repeatedly execute the following while
@ loop
@ because it is unable to get a handle mutex in nowait mode under 
@ kglobscfix_callback();
@ .
@ while (kglScanDependents(gp, hd, kglobscfix_callback, (void *)0));
@ .
@ and as we immediately re-get the dependency mutex under 
@ kglScanDependents->kglScanReferences but fail to get the handle mutex under
@ kglobscfix_callback() we can block other sessions that try to acquire the 
@ dependency mutex but can give up and retry (as is the case for CLMN/PMON).
@  
@ .
@ INTERNAL FIX DESCRIPTION:
@ The fix is to amend the while loop so that we enter a small wait which
@ should 
@ then allow blocked processes the ability to acquire the relevant mutex.

@ I think that we are more likely to be looping in the kglHandleMessage() while

@ loop as we might spin if kglobscfix_callback() cannot acquire the mutex.

@ RELEASE NOTES:

@ ]] It was possible for LCKx to block other sessions when it was trying to

@ ]] acquire handle and dependency mutexes. This problem has now been

@ ]] corrected.

@ ]]

@ .

@ INTERNAL PROBLEM DESCRIPTION:

@ Under rare conditions LCKx might repeatedly execute the following while

@ loop

@ because it is unable to get a handle mutex in nowait mode under

@ kglobscfix_callback();

@ .

@ while (kglScanDependents(gp, hd, kglobscfix_callback, (void *)0));

@ .

@ and as we immediately re-get the dependency mutex under

@ kglScanDependents->kglScanReferences but fail to get the handle mutex under

@ kglobscfix_callback() we can block other sessions that try to acquire the

@ dependency mutex but can give up and retry (as is the case for CLMN/PMON).

@ .

@ INTERNAL FIX DESCRIPTION:

@ The fix is to amend the while loop so that we enter a small wait which

@ should

@ then allow blocked processes the ability to acquire the relevant mutex.

bug描述与trace发现的信息几乎完全匹配。基本可以确定就是这个bug引起的故障。

数据库的patch信息为：

oracle@gapdb1:/home/oracle> opatch lspatches
28865544;
29123189;
27751755;
25163866;
23333567;
28790651;OJVM RELEASE UPDATE: 12.2.0.1.190115 (28790651)
28870605;OCW JAN 2019 RELEASE UPDATE 12.2.0.1.190115 (28870605)
28822515;Database Jan 2019 Release Update : 12.2.0.1.190115 (28822515)

oracle@gapdb1:/home/oracle> opatch lspatches

28865544;

29123189;

27751755;

25163866;

23333567;

28790651;OJVM RELEASE UPDATE: 12.2.0.1.190115 (28790651)

28870605;OCW JAN 2019 RELEASE UPDATE 12.2.0.1.190115 (28870605)

28822515;Database Jan 2019 Release Update : 12.2.0.1.190115 (28822515)

而该bug将在12.2的202007这个ru中修复。

over！！！

案例：troubleshooting LCK PROCESS HANG CAUSING ENTIRE DATABASE TO HANG

发表回复取消回复

近期文章

近期评论

案例：troubleshooting LCK PROCESS HANG CAUSING ENTIRE DATABASE TO HANG

发表回复 取消回复

近期文章

近期评论

发表回复取消回复