本案例来自东区某客户,数据库版本rac 11.2.0.4。有一条并发很高的sql,不定时的会产生大量的latch free,sql执行时间比正常时段有大概100倍的性能衰减。
由于客户业务比较特殊,sql分析的信息、awr都无法取出,这里只能用文字来描述。关键点如下:
- 客户使用的优化器模式为RBO,所有的表和索引均无统计信息。
- 首先分析latch free,可以通过P2关联v$latch.latch#来匹配具体是什么latch,通过排查看到的当时产生争用的具体latch为multiblock read objects。
- 查看具体sql的执行计划发现执行计划中只有简单INDEX RANGE SCAN +TABLE ACCESS BY INDEX ROWID
- 虽然使用的rbo,但是sql执行计划存在cost,说明仍然使用的CBO,原因是sql中使用了+index的hint
multiblock read objects顾名思义应该就是当对对象进行多块读请求的latch,但是执行计划并不涉及TFS和IFFS,都是单块读的操作,为何会请求这个latch呢?
思前想后只有2个可能性:
- oracle的preread特性
- 动态采样
关于oracle的preread特性在之前的文章有写过,这里就不描述了,基本可以排除这个原因,因为一个长期运行并未重启的系统,并且buffer cache基本没啥变化,preread基本是不会工作的。
那么很可能就是动态采样了,我们知道当使用rbo时是不会进行动态采样的。但是如果rbo的sql使用了hint其实就是使用CBO了,当表没有统计信息时,就会进行动态采样。这一点可以通过10046去确认。
如何确认是动态采样导致的呢?我采用了gdb+systemtap去跟踪函数的方式,虽然有点麻烦,但是我觉得唯有这样能非常明确的证明是否是这个问题导致的。(主要还是客户对技术比较执着,不然不太容易说服客户)
测试环境模拟以及进行测试的sql都与实际情况一致:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
SQL> @sp opt%mode -- show parameter by sp NAME TYPE VALUE ------------------------------------ ---------- ------------------------------ _optimizer_cost_model string CHOOSE _optimizer_mode_force boolean TRUE optimizer_mode string RULE SQL> select blocks,num_rows from dba_tables where owner='TEST' and table_name='T'; BLOCKS NUM_ROWS ---------- ---------- SQL> explain plan for select /*+index(IDX_2)*/ * from test.t where object_id=100; Explained. SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Plan hash value: 2629532460 ------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 14 | 2898 | 3 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| T | 14 | 2898 | 3 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | IDX_2 | 14 | | 1 (0)| 00:00:01 | ------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("OBJECT_ID"=100) Note ----- - dynamic sampling used for this statement (level=2) |
测试动态采样时:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
session 1: SQL> select latch#,name,addr from v$latch where name='multiblock read objects'; LATCH# NAME ADDR ---------- ---------------------------------------- ---------------- 179 multiblock read objects 000000006001D768 SQL> oradebug setmypid; Statement processed. SQL> oradebug call kslgetl 0x000000006001D768 1 Function returned 1 session 2: SQL> select /*+index(IDX_2)*/* from test.t where object_id=100000; --hang session 3: SQL> select event,p1,p2,p3 from v$session where program like '%sqlplus%'; EVENT P1 P2 P3 ---------------------------------------- ---------- ---------- ---------- latch free 1610733416 179 0 |
当使用动态采样时,通过gdb或者short_stack可以观察到具体的函数调用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
(gdb) bt #0 0x00007ff4e75f8307 in semop () from /lib64/libc.so.6 #1 0x00000000097f67d3 in sskgpwwait () #2 0x00000000097f5592 in skgpwwait () #3 0x00000000093e6ba5 in kslges () #4 0x00000000093e653e in kslgetl () --kernel service latching and post-wait get latch #5 0x0000000000e6efa8 in kcbz_mbr_get () #6 0x0000000000e622d3 in kcbzibmlt () --kernel cache buffers buffer handling input buffer (reads a block from disk into a buffer) for multiblock read #7 0x0000000000e6081f in kcbzib () --kernel cache buffers buffer handling input buffer (reads a block from disk into a buffer) #8 0x00000000094a260a in kcbgtcr () --kernel cache buffers get consistent read #9 0x0000000009508458 in ktrget3 () #10 0x0000000009507afe in ktrget2 () #11 0x000000000946a941 in kdst_fetch () #12 0x0000000000c77237 in kdstf00011010000km () #13 0x0000000009450f9f in kdsttgr () #14 0x00000000096faa5d in qertbFetch () #15 0x0000000002553e17 in qergsFetch () #16 0x00000000095a233a in opifch2 () #17 0x00000000095a6f3c in opiall0 () #18 0x00000000095b6958 in opikpr () |
其中本案例相关的关键函数如下:
- kslgetl:kernel service latching and post-wait get latch
- kcbzibmlt :kernel cache buffers buffer handling input buffer (reads a block from disk into a buffer) for multiblock read
- kcbzib :kernel cache buffers buffer handling input buffer (reads a block from disk into a buffer)
- kcbgtcr :kernel cache buffers get consistent read
systemtap更加细致的观察到该sql执行过程中调用的所有函数:
手动释放掉刚才持有的latch之后重新执行sql,并开启systemtap跟踪。
systemtap脚本
1 2 3 4 |
probe process("/oracle/app/oracle/product/11.2.0/dbhome_1/bin/oracle").function("*") { printf("%d\t%s: %s %x\n", pid(), execname(), ppfunc(), int_arg(1)); } |
运行stp脚本之后执行sql
1 |
select /*+index(IDX_2)*/* from test.t where object_id=100000; |
systemtap输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
3157 oracle: kcbzib ffffffff9e6cf308 3157 oracle: kslgetl ffffffff9e74ce08 3157 oracle: ksl_get_shared_latch ffffffff9e50ffd8 3157 oracle: ksl_level_check 1 3157 oracle: kslfre ffffffff9e50ffd8 3157 oracle: ksl_get_shared_latch ffffffff9e753d88 3157 oracle: kslfre ffffffff9e753d88 3157 oracle: kslfre ffffffff9e74ce08 3157 oracle: kslfre ffffffff9e58a510 3157 oracle: kcbzibmlt 667e5168 3157 oracle: kcbz_mbr_get ffffffff9e6cf308 3157 oracle: kslgetl 6001d768 3157 oracle: kslfre 6001d768 。。。 SQL> select latch#,name,addr from v$latch where name='multiblock read objects'; LATCH# NAME ADDR ---------- ---------------------------------------- ---------------- 179 multiblock read objects 000000006001D768 |
systemtap可以看到通过kslgetl函数请求的latch地址为6001d768,正好就是multiblock read objects。
当sql级别禁用动态采样时:
flush buffer cache之后执行sql:
1 2 3 4 5 |
SQL> alter system flush buffer_cache; System altered. SQL> select /*+DYNAMIC_SAMPLING(0) INDEX(IDX_2)*/ * from test.t where object_id=10000; |
1 2 3 4 5 6 7 8 9 10 |
[root@rac1 oracle]# stap -v o.stp -x 3157 --tmpdir=/db/tmp >/db/tmp/o.out Pass 1: parsed user script and 479 library scripts using 279376virt/76460res/3504shr/73160data kb, in 240usr/550sys/1012real ms. Pass 2: analyzed script: 165817 probes, 5 functions, 4 embeds, 4 globals using 1259352virt/1051972res/576shr/1053136data kb, in 57700usr/2030sys/61485real ms. Pass 3: using cached /root/.systemtap/cache/2c/stap_2c17b2912440445c7044a9a3069feb0b_114485619.c Pass 4: using cached /root/.systemtap/cache/2c/stap_2c17b2912440445c7044a9a3069feb0b_114485619.ko Pass 5: starting run. ^C^Z [4]+ Stopped stap -v o.stp -x 3157 --tmpdir=/db/tmp > /db/tmp/o.out [root@rac1 oracle]# grep -i 6001d768 /db/tmp/o.out [root@rac1 oracle]# |
并未发现该latch的请求。所以解决方案就是再加一个hint DYNAMIC_SAMPLING(0)禁用该sql的动态采样。