本案例来自南区某客户,数据库版本为rac 19c。故障现象为2节点大约凌晨2点开始,堆积了大量等待library cache lock和library cache:mutex X的session。一直持续了几分钟自动恢复正常。
Snap Id | Snap Time | Sessions | Cursors/Session | Instances | |
---|---|---|---|---|---|
Begin Snap: | 4250 | 18-Oct-22 02:00:20 | 1101 | 12.5 | 2 |
End Snap: | 4251 | 18-Oct-22 03:00:13 | 1106 | 12.0 | 2 |
Elapsed: | 59.89 (mins) | ||||
DB Time: | 14,394.56 (mins) |
Top 10 Foreground Events by Total Wait Time
Event | Waits | Total Wait Time (sec) | Avg Wait | % DB time | Wait Class |
---|---|---|---|---|---|
library cache lock | 1,931,334 | 569.8K | 295.05ms | 66.0 | Concurrency |
library cache: mutex X | 3,686,589 | 205.1K | 55.63ms | 23.7 | Concurrency |
DB CPU | 36K | 4.2 |
从awr的一些基本信息来看,确实lc lock和lc mutex x等待非常严重。通常关于mutex类的争用,可以去看看awr中Mutex Sleep Summary部分。
Mutex Sleep Summary
- ordered by number of sleeps desc
Mutex Type | Location | Sleeps | Wait Time (ms) |
---|---|---|---|
Library Cache | kgllkal2 81 | 18,969,541 | 64,710,767 |
Library Cache | kgllkdl1 85 | 5,079,086 | 26,691,704 |
Library Cache | kglhdgn2 106 | 1,383,344 | 6,754,413 |
- kgllkdl – kernel generic library cache management library cache lock delete
- kgllkal – kernel generic library cache management library cache lock allocate
- kglhdgn – kernel generic library cache management object handle get handle, if it doesn’t exist, create it
根据sleep的情况来看,mutex的争用主要来自kgllkal、kgllkdl、kglhdgn,分别对应lc lock的分配和释放。而kglhdgn对应的是library cache handle。
我们知道在父游标句柄中保存有SQL文本。搜索hash链表的目的,就是对比每个父游标句柄中的SQL文本,找到目标父游标句柄。找到之后,要再次以独占方式申请持有类型为“Library Cache”的Mutex,如果遇到竞争,此处的等待事件也是library cache: mutex X,为了加以区分,此处的Mutex Miss通常是“kglhdgn2 106”。
在此处的Mutex保护下,进程将进一步获得父游标句柄上的Library cache lock。Library cache lock获得成功,Mutex才会释放。这也是为什么library cache lock和library cache:mutex X会在top event同时出现的原因。
lc handle上的Mutex,代替的是以前版本的Library Cache Lock Latch。由于要依靠Library Cache Lock实现依赖链,Mutex并没有取代Library Cache Lock,只是把Library Cache Lock Latch替代了。
分析到这里,可以判断问题的根源是library cache lock。此前多篇lc lock的文章都提到过lc lock的分析主要是分析P3参数。通过awr中的ash信息可以查看。
Top Event P1/P2/P3 Values
- Top Events by DB Time and the top P1/P2/P3 values for those events.
- % Event is the percentage of DB Time due to the event
- % Activity is the percentage of DB Time due to the event with the given P1,P2,P3 Values.
Event | % Event | P1, P2, P3 Values | % Activity | Parameter 1 | Parameter 2 | Parameter 3 |
---|---|---|---|---|---|---|
library cache lock | 69.38 | “215109965320”,”208720642768″,”5373954″ | 0.05 | handle address | lock address | 100*mode+namespace |
library cache: mutex X | 25.23 | “1370341983”,”4514010628096″,”5374033″ | 1.54 | idn | value | where |
对P3进行换算之后可以查出对应的namespace和请求mode
1 2 3 4 5 6 7 8 9 10 11 |
SQL> select to_char('5373954','xxxxxxxxxxxxxxxxxxxxx') from dual; TO_CHAR('5373954','XXX ---------------------- 520002 SQL> select KGLSTDSC from x$kglst where INDX=(select to_number('52','xxxxxxxxxxxxxxxxx') from dual); KGLSTDSC ---------------------------------------------------------------- SQL AREA BUILD |
lc lock等待是发生在SQL AREA BUILD的handle上的,并且请求模式是S,说明当时肯定有session持有X的lc lock或者有X的lc lock请求。
首先说说SQL AREA BUILD,该namespace的lc lock只会发生在硬解析时生成子游标中,Oracle 在11.2 版本引入了Cursor Build Lock 机制,这一机制使得在某个父游标下创建子游标的工作串行化。当获取Build Lock 时,需要持有对应SQL AREA BUILD的Library Cache Lock。
通过_kgl_debug跟踪发现,当产生子游标时,在SQL AREA BUILD上,kgllkal的lockMode为S,而kgllkdl的lockMode为X。其中name部分LXY.$BUILD$.8af744b82f4cff89中的8af744b82f4cff89为sql的full hash value。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
<KGLTRACE> <Timestamp>2022-10-31 22:30:00.732</Timestamp> <SID>31</SID> <Function>kgllkal</Function> <Reason>TRACELOCK</Reason> <Param1>0x6bf307f8</Param1> <Param2>0</Param2> <LibraryHandle> <Address>0x6125cd30</Address> <Hash>c5b6f396</Hash> <LockMode>S</LockMode> <PinMode>0</PinMode> <LoadLockMode>0</LoadLockMode> <Status>0</Status> <Subpool>1</Subpool> <ObjectName> <Name>LXY.$BUILD$.8af744b82f4cff89 </Name> <FullHashValue>0db08ca1c94cf646b31eb140c5b6f396</FullHashValue> <Namespace>SQL AREA BUILD(82)</Namespace> <Type>CURSOR(00)</Type> <ContainerId>3</ContainerId> <ContainerUid>2360028483</ContainerUid> <Identifier>0</Identifier> <OwnerIdn>0</OwnerIdn> </ObjectName> </LibraryHandle> <LibraryObjectLock> <Address>0x6bf307f8</Address> <Handle>0x6125cd30</Handle> <Mode>S</Mode> </LibraryObjectLock> </KGLTRACE> <KGLTRACE> <Timestamp>2022-10-31 22:30:00.780</Timestamp> <SID>31</SID> <Function>kgllkdl</Function> <Reason>TRACELOCK</Reason> <Param1>0x6bf307f8</Param1> <Param2>1</Param2> <LibraryHandle> <Address>0x6125cd30</Address> <Hash>c5b6f396</Hash> <LockMode>X</LockMode> <PinMode>0</PinMode> <LoadLockMode>0</LoadLockMode> <Status>0</Status> <Subpool>1</Subpool> <ObjectName> <Name>LXY.$BUILD$.8af744b82f4cff89 </Name> <FullHashValue>0db08ca1c94cf646b31eb140c5b6f396</FullHashValue> <Namespace>SQL AREA BUILD(82)</Namespace> <Type>CURSOR(00)</Type> <ContainerId>3</ContainerId> <ContainerUid>2360028483</ContainerUid> <Identifier>0</Identifier> <OwnerIdn>0</OwnerIdn> </ObjectName> </LibraryHandle> <LibraryObjectLock> <Address>0x6bf307f8</Address> <Handle>0x6125cd30</Handle> <Mode>X</Mode> </LibraryObjectLock> </KGLTRACE> |
那么可以判断如果发生了SQL AREA BUILD上的lc lock争用,那么当时肯定有新的子游标产生。剩下的问题就是分析为何会产生该子游标了。
1 2 3 4 5 6 7 8 9 10 11 12 13 |
SQL> select child_number,plan_hash_value,FIRST_LOAD_TIME,last_load_time,IS_BIND_SENSITIVE,is_bind_aware,is_shareable from gv$sql where inst_id=2 and sql_id='a2v3du5a69074' ; CHILD_NUMBER PLAN_HASH_VALUE FIRST_LOAD_TIME LAST_LOAD_TIME IS IS IS ------------ --------------- ---------------------------------------- ---------------------------------------- -- -- -- 5 3403358360 2022-06-03/18:52:40 2022-10-19/04:35:49 Y N Y 8 3403358360 2022-06-03/18:52:40 2022-10-18/02:05:41 Y N Y 25 3403358360 2022-06-03/18:52:40 2022-10-19/02:22:05 Y N Y 26 3403358360 2022-06-03/18:52:40 2022-10-19/03:30:07 Y N Y 27 3403358360 2022-06-03/18:52:40 2022-10-19/03:39:51 Y N Y 28 3403358360 2022-06-03/18:52:40 2022-10-19/05:00:52 Y N Y 32 3403358360 2022-06-03/18:52:40 2022-10-09/09:19:23 Y N Y 7 rows selected. |
可以看到故障前实例中存在的子游标是32,而子游标8正是故障时段产生的。故障是2点开始,而该子游标2点05分才被load成功。说明花了5分钟才生成该子游标。
从sql的执行情况来看,该sql本身就是并发非常高的sql。对于产生新子游标这种事情出问题的概率就非常高了。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
+------------------------------------------------------------------------+ | information from v$sql | +------------------------------------------------------------------------+ PLAN CHI USER CPU(MS) ELA(MS) DISK GET ROWS ROWS APPLI(MS) CONCUR(MS) CLUSTER(MS) USER_IO(MS) FIRST_LOAD_TIME EXEC HASH VALUE NUM NAME PRE EXEC PRE EXEC PRE EXEC PRE EXEC PRE EXEC PRE FETCH PER EXEC PER EXEC PER EXEC PER EXEC LAST_LOAD_TIME ---------- ------------- ---- ---------- -------- -------- -------- ------------ ----------- --------- --------- ---------- ----------- ----------- ---------------------- 4192782 3403358360 5 APHMES 0 0 0 4 1 1 0 0 0 0 06-03/18:5.10-19/04:3 1989093 3403358360 8 APHMES 0 24 0 4 1 1 0 23 0 0 06-03/18:5.10-18/02:0 19208800 3403358360 25 APHMES 0 0 0 4 1 1 0 0 0 0 06-03/18:5.10-19/02:2 23192453 3403358360 26 APHMES 0 0 0 4 1 1 0 0 0 0 06-03/18:5.10-19/03:3 30173 3403358360 27 APHMES 0 0 0 5 1 1 0 0 0 0 06-03/18:5.10-19/03:3 1440766 3403358360 28 APHMES 0 0 0 4 1 1 0 0 0 0 06-03/18:5.10-19/05:0 24842454 3403358360 32 APHMES 0 0 0 4 1 1 0 0 0 0 06-03/18:5.10-09/09:1 7 rows selected. +------------------------------------------------------------------------+ | information from awr sysdate-7 | +------------------------------------------------------------------------+ PLAN GET DISK WRITE ROWS ROWS USER_IO(MS) ELA(MS) CPU(MS) CLUSTER(MS) PLSQL END_TI I NAME HASH VALUE EXEC PRE EXEC PRE EXEC PER EXEC ROW_P PRE EXEC PRE FETCH PER EXEC PRE EXEC PRE EXEC PER EXEC PER EXEC ------ - --------------- ------------- ---------- ------------ -------- -------- ----- ----------- --------- ----------- -------- -------- ----------- -------- 14 23 2 APHMES 3403358360 250W 4 0 0 250W 0 1 0 34 24 0 0 15 00 2 APHMES 3403358360 265W 4 0 0 265W 0 1 0 34 24 0 0 15 01 2 APHMES 3403358360 244W 4 0 0 244W 0 1 0 34 24 0 0 15 02 2 APHMES 3403358360 239W 4 0 0 239W 0 1 0 33 23 0 0 15 03 2 APHMES 3403358360 256W 4 0 0 256W 0 1 0 33 24 0 0 15 04 2 APHMES 3403358360 253W 4 0 0 253W 0 1 0 34 24 0 0 15 05 2 APHMES 3403358360 248W 4 0 0 248W 0 1 0 33 24 0 0 15 06 2 APHMES 3403358360 223W 4 0 0 223W 0 1 0 33 24 0 0 15 07 2 APHMES 3403358360 111W 4 0 0 111W 0 1 0 35 25 0 0 15 08 2 APHMES 3403358360 72.W 4 0 0 72.W 0 1 0 35 25 0 0 15 09 2 APHMES 3403358360 263W 4 0 0 263W 0 1 0 34 24 0 0 15 10 2 APHMES 3403358360 310W 4 0 0 310W 0 1 0 34 25 0 0 15 11 2 APHMES 3403358360 307W 4 0 0 307W 0 1 0 34 24 0 0 15 12 2 APHMES 3403358360 292W 4 0 0 292W 0 1 0 34 24 0 0 15 13 2 APHMES 3403358360 265W 4 0 0 265W 0 1 0 32 23 0 0 15 14 2 APHMES 3403358360 274W 4 0 0 274W 0 1 0 33 25 0 0 15 15 2 APHMES 3403358360 307W 4 0 0 307W 0 1 0 34 26 0 0 15 16 2 APHMES 3403358360 307W 4 0 0 307W 0 1 0 35 26 0 0 15 17 2 APHMES 3403358360 303W 4 0 0 303W 1 1 0 35 26 0 0 15 18 2 APHMES 3403358360 276W 4 0 0 276W 0 1 0 34 26 0 0 15 19 2 APHMES 3403358360 202W 4 0 0 202W 0 1 0 35 26 0 0 15 20 2 APHMES 3403358360 118W 4 0 0 118W 0 1 0 34 26 0 0 15 21 2 APHMES 3403358360 239W 4 0 0 239W 0 1 0 33 25 0 0 15 22 2 APHMES 3403358360 271W 4 0 0 271W 1 1 0 33 25 0 0 15 23 2 APHMES 3403358360 273W 4 0 0 273W 0 1 0 34 25 0 0 16 00 2 APHMES 3403358360 277W 4 0 0 277W 0 1 0 34 25 0 0 16 01 2 APHMES 3403358360 260W 4 0 0 260W 0 1 0 33 25 0 0 16 02 2 APHMES 3403358360 259W 4 0 0 259W 0 1 0 33 25 0 0 16 03 2 APHMES 3403358360 265W 4 0 0 265W 0 1 0 33 25 0 0 16 04 2 APHMES 3403358360 266W 4 0 0 266W 0 1 0 34 25 0 0 16 05 2 APHMES 3403358360 268W 4 0 0 268W 0 1 0 34 25 0 0 16 06 2 APHMES 3403358360 198W 4 0 0 198W 0 1 0 33 25 0 0 16 07 2 APHMES 3403358360 127W 4 0 0 127W 0 1 0 35 26 0 0 16 08 2 APHMES 3403358360 82.W 4 0 0 82.W 0 1 0 34 26 0 0 16 09 2 APHMES 3403358360 279W 4 0 0 279W 0 1 0 34 25 0 0 16 10 2 APHMES 3403358360 299W 4 0 0 299W 0 1 0 34 26 0 0 16 11 2 APHMES 3403358360 312W 4 0 0 312W 0 1 0 34 26 0 0 16 12 2 APHMES 3403358360 316W 4 0 0 316W 0 1 0 35 26 0 0 16 13 2 APHMES 3403358360 292W 4 0 0 292W 0 1 0 34 26 0 0 16 14 2 APHMES 3403358360 299W 4 0 0 299W 0 1 0 34 26 0 0 16 15 2 APHMES 3403358360 310W 4 0 0 310W 1 1 0 35 27 0 0 16 16 2 APHMES 3403358360 314W 4 0 0 314W 0 1 0 35 27 0 0 16 17 2 APHMES 3403358360 302W 4 0 0 302W 1 1 0 35 27 0 0 16 18 2 APHMES 3403358360 243W 4 0 0 243W 0 1 0 35 27 0 0 16 19 2 APHMES 3403358360 157W 4 0 0 157W 0 1 0 35 26 0 0 16 20 2 APHMES 3403358360 80.W 4 0 0 80.W 0 1 0 35 26 0 0 16 21 2 APHMES 3403358360 233W 4 0 0 233W 0 1 0 34 26 0 0 16 22 2 APHMES 3403358360 255W 4 0 0 255W 0 1 0 33 25 0 0 16 23 2 APHMES 3403358360 259W 4 0 0 259W 0 1 0 33 25 0 0 17 00 2 APHMES 3403358360 274W 4 0 0 274W 1 1 0 34 26 0 0 17 01 2 APHMES 3403358360 252W 4 0 0 252W 0 1 0 33 25 0 0 17 02 2 APHMES 3403358360 247W 4 0 0 247W 0 1 0 33 25 0 0 17 03 2 APHMES 3403358360 258W 4 0 0 258W 0 1 0 33 25 0 0 17 04 2 APHMES 3403358360 267W 4 0 0 267W 0 1 0 34 25 0 0 17 05 2 APHMES 3403358360 257W 4 0 0 257W 0 1 0 33 25 0 0 17 06 2 APHMES 3403358360 202W 4 0 0 202W 0 1 0 33 25 0 0 17 07 2 APHMES 3403358360 120W 4 0 0 120W 0 1 0 34 25 0 0 17 08 2 APHMES 3403358360 80.W 4 0 0 80.W 0 1 0 34 25 0 0 17 09 2 APHMES 3403358360 262W 4 0 0 262W 0 1 0 33 25 0 0 17 10 2 APHMES 3403358360 314W 4 0 0 314W 0 1 0 33 25 0 0 17 11 2 APHMES 3403358360 304W 4 0 0 304W 0 1 0 33 25 0 0 17 12 2 APHMES 3403358360 301W 4 0 0 301W 0 1 0 33 25 0 0 17 13 2 APHMES 3403358360 281W 4 0 0 281W 0 1 0 32 25 0 0 17 14 2 APHMES 3403358360 288W 4 0 0 288W 0 1 0 33 25 0 0 17 15 2 APHMES 3403358360 315W 4 0 0 315W 0 1 0 33 26 0 0 17 16 2 APHMES 3403358360 314W 4 0 0 314W 0 1 0 34 26 0 0 17 17 2 APHMES 3403358360 309W 4 0 0 309W 0 1 0 34 26 0 0 17 18 2 APHMES 3403358360 250W 4 0 0 250W 0 1 0 34 26 0 0 17 19 2 APHMES 3403358360 162W 4 0 0 162W 0 1 0 35 26 0 0 17 20 2 APHMES 3403358360 89.W 4 0 0 89.W 0 1 0 35 26 0 0 17 21 2 APHMES 3403358360 245W 4 0 0 245W 0 1 0 33 25 0 0 17 22 2 APHMES 3403358360 267W 4 0 0 267W 0 1 0 34 26 0 0 17 23 2 APHMES 3403358360 275W 4 0 0 275W 0 1 0 34 26 0 0 18 00 2 APHMES 3403358360 274W 4 0 0 274W 0 1 0 34 26 0 0 18 01 2 APHMES 3403358360 249W 4 0 0 249W 0 1 0 33 25 0 0 18 02 2 APHMES 3403358360 244W 4 0 0 244W 0 1 0 36 26 0 0 18 03 2 APHMES 3403358360 185W 4 0 0 185W 0 1 0 229,010 184 0 0 18 04 2 APHMES 3403358360 262W 4 0 0 262W 0 1 0 34 25 0 0 18 05 2 APHMES 3403358360 253W 4 0 0 253W 0 1 0 34 25 0 0 18 06 2 APHMES 3403358360 213W 4 0 0 213W 0 1 0 33 25 0 0 18 07 2 APHMES 3403358360 121W 4 0 0 121W 0 1 0 34 26 |
通常会通过v$sql_shared_cursor去分析子游标不能共享的原因。子游标不能共享的原因会按发现有cursor不能共享的时间以xml的格式添加在reason字段中。
当时32号子游标为啥不能共享的原因可以看到是Bind mismatch。
1 2 3 4 5 6 7 8 9 10 |
32 Bind mismatch(17): <ChildNode> <ChildNumber>32</ChildNumber> <ID>39</ID> <reason>Bind mismatch(17)</reason> <size>4x8</size> <bind_position>0000000100000000 </bind_position> <original_oacflg>0000000300000000 </original_oacflg> <original_oacflg>0000000300000000 </original_oacflg> <new_oacflg>0000001300000000 </new_oacflg> </ChildNode> |
通过DBMS_SQLTUNE包可以查看子游标的bind值。
childnumber:32
1 2 3 4 5 6 7 8 9 |
SQL> select position, datatype_string,last_captured,value_string 2 FROM TABLE (SELECT DBMS_SQLTUNE.extract_binds (bind_data) 3 FROM v$sql 4 WHERE sql_id='a2v3du5a69074' AND child_number = 32); POSITION DATATYPE_STRING LAST_CAPTURED VALUE_STRING ---------- ------------------------------ ------------------- ---------------------------------------- 1 VARCHAR2(8192) 2022-10-14 01:34:22 GRW237303Z01XYYAU+014 2 VARCHAR2(128) 2022-10-14 01:34:22 e396691574a049208bec0d520f40efd7 |
childnumber:8
1 2 3 4 5 6 7 8 9 |
SQL> select position, datatype_string,last_captured,value_string 2 FROM TABLE (SELECT DBMS_SQLTUNE.extract_binds (bind_data) 3 FROM v$sql 4 WHERE sql_id='a2v3du5a69074' AND child_number = 8); POSITION DATATYPE_STRING LAST_CAPTURED VALUE_STRING ---------- ------------------------------ ------------------- ---------------------------------------- 1 VARCHAR2(32767) 2022-10-19 04:50:41 F8Y243105421LRHAH+235 2 VARCHAR2(32) 2022-10-19 04:50:41 e396691574a049208bec0d520f40efd7 |
可以看到绑定变量长度差距很大,明显32和128的varchar2是不能共享的。
后面又发现在故障前,shared pool中大量内存kglhd(handle)和sqla(heap 6)被释放,初步怀疑被flush了shared pool。
对于该故障我的建议就是对于高并发的sql,应用尽量使用绑定变量长度一致的绑定变量,避免此类问题发生。
over!!!