本案例来自东区某金融客户,数据库版本为19c,故障现象为备库每天早上8点40左右,备库的LGWR都会被阻塞。从而其他应用因为请求不到instance lock也被LGWR阻塞,让客户在第二天重现的时候收集了systemstate dump进行分析。
LGWR PROCESS STATE OBJECT:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
PROCESS 14: LGWR ---------------------------------------- SO: 0x3d21de9858, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x3d21de9858, name=process, file=ksu.h LINE:12721, pg=0 (process) Oracle pid:14, ser:2, calls cur/top: 0x3ec22fda30/0x3ec22fda30 flags : (0x6) SYSTEM flags2: (0x0), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 26 last post received-location: ksa2.h LINE:285 ID:ksasnd last process to post me: 0x3da1e98788 1 0 last post sent: 0 0 259 last post sent-location: kgl.h LINE:8873 ID:kgllkdl: post after freeing latch last process posted by me: 0x3da1eb2980 171 0 (latch info) wait_event=0 bits=0x0 Process Group: DEFAULT, pseudo proc: 0x3d21ec8180 O/S info: user: oracle, term: UNKNOWN, ospid: 172824 OSD pid info: Unix process pid: 172824, image: oracle@FMS11DG (LGWR) Short stack dump: ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+178<-ksliwat()+2066<-kslwaitctx()+163<-ksfwaitctx()+14<-kglLockWait()+871<-kgllkal()+1166<-kglLock()+1118<-kglLockInstance()+259<-kksExclusiveParseLock()+47<-krdrsb_mdflush_local()+126<-krdrsb_adv_qscn()+1652<-ksbabs()+771<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+245 ---------------------------------------- |
LGWR SESSION STATE OBJECT:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
SO: 0x3d21f74f20, type: 4, owner: 0x3d21de9858, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x3d21de9858, name=session, file=ksu.h LINE:12729, pg=0 (session) sid: 337 ser: 3 trans: (nil), creator: 0x3d21de9858 flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x409) -/-/INC DID: , short-term DID: txn branch: (nil) edition#: 0 oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS ksuxds FALSE at location: 0 service name: SYS$BACKGROUND Current Wait Stack: 0: waiting for 'library cache lock' handle address=0x3e5ff630f8, lock address=0x3e7ff41450, 100*mode+namespace=0x1004a0003 wait_id=9628259 seq_num=60150 snap_id=1 wait times: snap=1.306065 sec, exc=1.306065 sec, total=1.306065 sec wait times: max=5.500000 sec, heur=1 min 8 sec wait counts: calls=2 os=2 in_wait=1 iflags=0x15a2 |
LGWR SID为337,等待lc lock,namespace一看就是instance lock,请求模式为X。
这里偷了个懒,很碰巧看到了final blocking session,就省的去搜索哪个session持有了instance lock了。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
SO: 0x3d42261708, type: 4, owner: 0x3da1eae6a0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x3da1eae6a0, name=session, file=ksu.h LINE:12729, pg=0 (session) sid: 2045 ser: 1 trans: (nil), creator: 0x3da1eae6a0 flags: (0x8000041) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x40009) -/-/INC DID: , short-term DID: txn branch: (nil) edition#: 2605814 oct: 3, prv: 0, sql: 0x3e9fb65bd0, psql: 0x3e5e5b1860, user: 141/DBAAS_SYS ksuxds FALSE at location: 0 service name: fmsdg client details: O/S info: user: zcloud, term: , ospid: 7395 machine: VCOM01PXY program: zoramon_collector@VCOM01PXY (TNS V1-V3) application name: zoramon_collector@VCOM01PXY (TNS V1-V3), hash value=2183771371 Current Wait Stack: 0: waiting for 'library cache lock' handle address=0x3e5ff630f8, lock address=0x3e1fbb8dd0, 100*mode+namespace=0x1004a0002 wait_id=5523383 seq_num=18444 snap_id=1 wait times: snap=1.336710 sec, exc=1.336710 sec, total=1.336710 sec wait times: max=15 min 0 sec, heur=1.336710 sec wait counts: calls=2 os=2 in_wait=1 iflags=0x15a2 There is at least one session blocking this session. Dumping 1 direct blocker(s): inst: 1, sid: 337, ser: 3 Dumping final blocker: inst: 1, sid: 1976, ser: 29501 |
final blocker为session 1976
final blocker session state object:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
SO: 0x3dc221a830, type: 4, owner: 0x3ec1e41a68, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x3ec1e41a68, name=session, file=ksu.h LINE:12729, pg=0 (session) sid: 1976 ser: 29501 trans: (nil), creator: 0x3ec1e41a68 flags: (0x41) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x40009) -/-/INC DID: , short-term DID: txn branch: (nil) edition#: 2605814 oct: 3, prv: 0, sql: 0x3e5e197d50, psql: 0x3e9f487090, user: 142/FIX_DBQUERY ksuxds FALSE at location: 0 service name: SYS$USERS client details: O/S info: user: afaim, term: unknown, ospid: 1234 machine: VFIX02APP program: JDBC Thin Client application name: JDBC Thin Client, hash value=2546894660 Current Wait Stack: Not in wait; last wait ended 1 min 16 sec ago There are 2 sessions blocked by this session. Dumping one waiter: inst: 1, sid: 337, ser: 3 wait event: 'library cache lock' p1: 'handle address'=0x3e5ff630f8 p2: 'lock address'=0x3e7ff41450 p3: '100*mode+namespace'=0x1004a0003 row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 min_blocked_time: 0 secs, waiter_cache_ver: 24873 |
final blocker没有任何等待,正在执行的sql为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
SO: 0x3df82faf90, type: 78, owner: 0x3dc221a830, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x3ec1e41a68, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8751, pg=0 LibraryObjectLock: Address=0x3df82faf90 Handle=0x3e5e197d50 Mode=N CanBeBrokenCount=1 Incarnation=1 ExecutionCount=0 User=0x3dc221a830 Session=0x3dc221a830 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=6552c1ec LibraryHandle: Address=0x3e5e197d50 Hash=3584acbe LockMode=N PinMode=0 LoadLockMode=0 Status=VALD ObjectName: Name=select s.VC_INTER_CODE, s.C_MARKET_NO, s.L_BUY_UNIT, s.VC_TRADE_CURRENCY_NO, s.EN_YESTERDAY_CLOSE_PRICE, s.VC_MIXED_TYPE, s.C_STOCK_TYPE, s.C_ASSET_CLASS, s.C_STOP_FLAG, s.VC_REPORT_CODE from trade.TSTOCKINFO s WHERE s.VC_INTER_CODE in ( :1 , :2 , :3 , :4 FullHashValue=dbbe27bcd13925c46366f2fc3584acbe Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=897887422 OwnerIdn=142 Statistics: InvalidationCount=12 ExecutionCount=2804 LoadCount=14 ActiveLocks=1 TotalLockCount=58 TotalPinCount=1 Counters: BrokenCount=1 RevocablePointer=1 KeepDependency=1 Version=0 BucketInUse=57 HandleInUse=57 HandleReferenceCount=0 Concurrency: DependencyMutex=0x3e5e197e00(0, 135, 0, 0) Mutex=0x3e5e197e90(0, 209907, 0, 0) Flags=RON/PIN/TIM/PN0/DBN/[10012841] WaitersLists: Lock=0x3e5e197de0[0x3e5e197de0,0x3e5e197de0] Pin=0x3e5e197dc0[0x3e5e197dc0,0x3e5e197dc0] LoadLock=0x3e5e197e38[0x3e5e197e38,0x3e5e197e38] Timestamp: Current=10-29-2023 08:40:12 HandleReference: Address=0x3e5e198110 Handle=(nil) Flags=[00] LibraryObject: Address=0x3c7e8a0e00 HeapMask=0000-0001-0001-0000 Flags=EXS[0000] Flags2=[0000] PublicFlags=[0000] ChildTable: size='16' Child: id='0' Table=0x3c7e8a1cb0 Reference=0x3c7e8a16f8 Handle=0x3e5e2ea8f0 Child: id='1' Table=0x3c7e8a1cb0 Reference=0x3c7e8972c0 Handle=0x3e5de512f0 NamespaceDump: Parent Cursor: sql_id=66trkzhus9b5y parent=0x3c7e8a0ea0 maxchild=2 plk=y ppn=n |
有两个子游标handle,
- Handle=0x3e5e2ea8f0
- Handle=0x3e5de512f0
查看子游标lc state object:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
SO: 0x3df82fa9f8, type: 78, owner: 0x3dc221a830, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x3ec1e41a68, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8751, pg=0 LibraryObjectLock: Address=0x3df82fa9f8 Handle=0x3e5e2ea8f0 Mode=N CanBeBrokenCount=23 Incarnation=23 ExecutionCount=0 ClusterLock=0x3dfe9ca0a8 Context=0x7f59809d8030 User=0x3dc221a830 Session=0x3dc221a830 ReferenceCount=1 Flags=CBK/[0020] SavepointNum=0 LibraryHandle: Address=0x3e5e2ea8f0 Hash=0 LockMode=N PinMode=X LoadLockMode=0 Status=VALD Name: Namespace=SQL AREA(00) Type=CURSOR(00) Statistics: InvalidationCount=11 ExecutionCount=0 LoadCount=13 ActiveLocks=1 TotalLockCount=59 TotalPinCount=2652 Counters: BrokenCount=23 RevocablePointer=23 KeepDependency=0 Version=0 BucketInUse=0 HandleInUse=0 HandleReferenceCount=0 Concurrency: DependencyMutex=0x3e5e2ea9a0(0, 0, 0, 0) Mutex=0x3e5e197e90(0, 209907, 0, 0) Flags=RON/PIN/PN0/EXP/CHD/[10012111] WaitersLists: Lock=0x3e5e2ea980[0x3e5e2ea980,0x3e5e2ea980] Pin=0x3e5e2ea960[0x3e5e2ea960,0x3e5e2ea960] LoadLock=0x3e5e2ea9d8[0x3e5e2ea9d8,0x3e5e2ea9d8] LibraryObject: Address=0x3c7f36d0b0 HeapMask=0000-0001-0001-0000 Flags=EXS[0000] Flags2=[0000] PublicFlags=[0000] DataBlocks: Block: #='0' name=KGLH0^3584acbe pins=0 Change=NONE Heap=0x3e5ce16cb0 Pointer=0x3c7f36d150 Extent=0x3c7f36d030 Flags=I/-/P/A/-/- FreedLocation=0 Alloc=64.843750 Size=67.601562 LoadTime=6923111190 Block: #='6' name=SQLA^3584acbe pins=0 Change=NONE Heap=0x3c7e8a14c8 Pointer=0x3024187ea8 Extent=0x3024187268 Flags=I/-/P/A/-/E FreedLocation=0 Alloc=69385.406250 Size=70401.898438 LoadTime=0 NamespaceDump: Child Cursor: Heap0=0x3c7f36d150 Heap6=0x3024187ea8 Heap0 Load Time=11-14-2023 08:40:13 Heap6 Load Time=11-14-2023 08:40:13 ---------------------------------------- KGX Atomic Operation Log 0x3dfe9ca0a8 Mutex 0x3c7e8a1420(1976, 0) idn 3584acbe oper LONG_EXCL(18) Cursor Pin uid 1976 efd 0 whr 1 slp 0 opr=3 pso=0x3df82fa9f8 flg=0 pcs=0x3c7e8a1388 nxt=0x3c7e897020 flg=34 cld=0 hd=0x3e5e2ea8f0 par=0x3c7e8a0ea0 ct=c hsh=0 unp=(nil) unn=0 hvl=7e897d28 nhv=1 ses=0x3dc221a830 hep=0x3c7e8a1420 flg=80 ld=1 ob=0x3c7f36d0b0 ptr=0x3024187ea8 fex=0x3024187268 |
load时间与故障时间吻合,并且长时间的X模式持有cursor pin mutex和lc pin,这里可以看到heap 6占用了大量的空间,不由得想去看看这个sql的执行计划。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
Plan hash value: 3616047354 ------------------------------------------------------ | Id | Operation | Name | ------------------------------------------------------ | 0 | SELECT STATEMENT | | | 1 | CONCATENATION | | | 2 | TABLE ACCESS BY INDEX ROWID| TSTOCKINFO | |* 3 | INDEX UNIQUE SCAN | PK_TSTOCKINFO | | 4 | TABLE ACCESS BY INDEX ROWID| TSTOCKINFO | |* 5 | INDEX UNIQUE SCAN | PK_TSTOCKINFO | | 6 | TABLE ACCESS BY INDEX ROWID| TSTOCKINFO | |* 7 | INDEX UNIQUE SCAN | PK_TSTOCKINFO | | 8 | TABLE ACCESS BY INDEX ROWID| TSTOCKINFO | |* 9 | INDEX UNIQUE SCAN | PK_TSTOCKINFO | | 10 | TABLE ACCESS BY INDEX ROWID| TSTOCKINFO | |* 11 | INDEX UNIQUE SCAN | PK_TSTOCKINFO | | 12 | TABLE ACCESS BY INDEX ROWID| TSTOCKINFO | |* 13 | INDEX UNIQUE SCAN | PK_TSTOCKINFO | | 14 | TABLE ACCESS BY INDEX ROWID| TSTOCKINFO | ... ... |1995 | INDEX UNIQUE SCAN | PK_TSTOCKINFO | |1996 | TABLE ACCESS BY INDEX ROWID| TSTOCKINFO | |1997 | INDEX UNIQUE SCAN | PK_TSTOCKINFO | |1998 | TABLE ACCESS BY INDEX ROWID| TSTOCKINFO | |1999 | INDEX UNIQUE SCAN | PK_TSTOCKINFO | |2000 | TABLE ACCESS BY INDEX ROWID| TSTOCKINFO | |2001 | INDEX UNIQUE SCAN | PK_TSTOCKINFO | ------------------------------------------------------ |
看了吓一跳,首先优化器选用的RBO,很简单的sql,但是in里面变量达到1000个,正好还走了or展开。1000个变量意味着有1000个or分支,难怪heap 6那么大。
看看blocker 的stack信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
SO: 0x3ec1e41a68, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x3ec1e41a68, name=process, file=ksu.h LINE:12721, pg=0 (process) Oracle pid:178, ser:231, calls cur/top: 0x3d6b429b98/0x3d6b514a00 flags : (0x0) - flags2: (0x0), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 160 last post received-location: kcb2.h LINE:4243 ID:kcbzww last process to post me: 0x3ec1e57980 1 0 last post sent: 0 0 0 last post sent-location: No post last process posted by me: none (latch info) wait_event=0 bits=0x0 Process Group: DEFAULT, pseudo proc: 0x3d21ec8180 O/S info: user: oracle, term: UNKNOWN, ospid: 301884 OSD pid info: Unix process pid: 301884, image: oracle@FMS11DG Short stack dump: ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-kghalf()+246<-kghalp()+94<-qcopCreateLog()+54<-qkebCreateUnaryLog()+43<-apaclg()+205<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apaclg()+238<-apasor()+281<-apaqba()+1122<-apaqbdDescendents()+496<-apaqbdList()+76<-apaqbd()+14<-apadrv()+922<-opitca()+2089<-kksFullTypeCheck()+69<-rpiswu2()+1776<-kksSetBindType()+2299<-kksfbc()+11192<-opiexe()+2330<-kpoal8()+2380<-opiodr()+917<-ttcpip()+1255<-opitsk()+1710<-opiino()+969<-opiodr()+917<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+265<-main()+201<-__libc_start_main()+245 ---------------------------------------- |
大量的循环了apaclg函数。 (apa)clg – SQL Access Path Analysis ??
分析到此基本可以给出建议了。3个建议同时调整:
- 减少in变量为100个
- 使用all_rows hint让执行计划走inlist
- keep sql执行计划