troubleshooting many latch free(multiblock read objects) wait when index range scan

本案例来自东区某客户，数据库版本rac 11.2.0.4。有一条并发很高的sql，不定时的会产生大量的latch free，sql执行时间比正常时段有大概100倍的性能衰减。

由于客户业务比较特殊，sql分析的信息、awr都无法取出，这里只能用文字来描述。关键点如下：

客户使用的优化器模式为RBO，所有的表和索引均无统计信息。
首先分析latch free，可以通过P2关联v$latch.latch#来匹配具体是什么latch，通过排查看到的当时产生争用的具体latch为multiblock read objects。
查看具体sql的执行计划发现执行计划中只有简单INDEX RANGE SCAN +TABLE ACCESS BY INDEX ROWID
虽然使用的rbo，但是sql执行计划存在cost，说明仍然使用的CBO，原因是sql中使用了+index的hint

multiblock read objects顾名思义应该就是当对对象进行多块读请求的latch，但是执行计划并不涉及TFS和IFFS，都是单块读的操作，为何会请求这个latch呢？

思前想后只有2个可能性：

oracle的preread特性
动态采样

关于oracle的preread特性在之前的文章有写过，这里就不描述了，基本可以排除这个原因，因为一个长期运行并未重启的系统，并且buffer cache基本没啥变化，preread基本是不会工作的。

案例：index range scan真的不会多块读吗？

那么很可能就是动态采样了，我们知道当使用rbo时是不会进行动态采样的。但是如果rbo的sql使用了hint其实就是使用CBO了，当表没有统计信息时，就会进行动态采样。这一点可以通过10046去确认。

如何确认是动态采样导致的呢？我采用了gdb+systemtap去跟踪函数的方式，虽然有点麻烦，但是我觉得唯有这样能非常明确的证明是否是这个问题导致的。(主要还是客户对技术比较执着，不然不太容易说服客户)

测试环境模拟以及进行测试的sql都与实际情况一致：

SQL> @sp opt%mode
-- show parameter by sp
NAME                                 TYPE       VALUE
------------------------------------ ---------- ------------------------------
_optimizer_cost_model                string     CHOOSE
_optimizer_mode_force                boolean    TRUE
optimizer_mode                       string     RULE
SQL> select blocks,num_rows from dba_tables where owner='TEST' and table_name='T';
BLOCKS   NUM_ROWS
---------- ----------
SQL> explain plan for select /*+index(IDX_2)*/ * from test.t where object_id=100;
Explained.
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2629532460
-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    14 |  2898 |     3   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T     |    14 |  2898 |     3   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | IDX_2 |    14 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("OBJECT_ID"=100)
Note
-----
- dynamic sampling used for this statement (level=2)

SQL> @sp opt%mode

-- show parameter by sp

NAME TYPE VALUE

------------------------------------ ---------- ------------------------------

_optimizer_cost_model string CHOOSE

_optimizer_mode_force boolean TRUE

optimizer_mode string RULE

SQL> select blocks,num_rows from dba_tables where owner='TEST' and table_name='T';

BLOCKS NUM_ROWS

---------- ----------

SQL> explain plan for select /*+index(IDX_2)*/ * from test.t where object_id=100;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Plan hash value: 2629532460

-------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 14 | 2898 | 3 (0)| 00:00:01 |

| 1 | TABLE ACCESS BY INDEX ROWID| T | 14 | 2898 | 3 (0)| 00:00:01 |

|* 2 | INDEX RANGE SCAN | IDX_2 | 14 | | 1 (0)| 00:00:01 |

-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("OBJECT_ID"=100)

Note

-----

- dynamic sampling used for this statement (level=2)

测试动态采样时：

session 1:
SQL> select latch#,name,addr from v$latch where name='multiblock read objects';
LATCH# NAME                                     ADDR
---------- ---------------------------------------- ----------------
179 multiblock read objects                  000000006001D768
SQL> oradebug setmypid;
Statement processed.
SQL> oradebug  call kslgetl 0x000000006001D768 1
Function returned 1
session 2：
SQL> select /*+index(IDX_2)*/* from test.t where object_id=100000;
--hang
session 3:
SQL> select event,p1,p2,p3 from v$session where program like '%sqlplus%';
EVENT                                            P1         P2         P3
---------------------------------------- ---------- ---------- ----------
latch free                               1610733416        179          0

session 1:

SQL> select latch#,name,addr from v$latch where name='multiblock read objects';

LATCH# NAME ADDR

---------- ---------------------------------------- ----------------

179 multiblock read objects 000000006001D768

SQL> oradebug setmypid;

Statement processed.

SQL> oradebug call kslgetl 0x000000006001D768 1

Function returned 1

session 2：

SQL> select /*+index(IDX_2)*/* from test.t where object_id=100000;

--hang

session 3:

SQL> select event,p1,p2,p3 from v$session where program like '%sqlplus%';

EVENT P1 P2 P3

---------------------------------------- ---------- ---------- ----------

latch free 1610733416 179 0

当使用动态采样时，通过gdb或者short_stack可以观察到具体的函数调用

(gdb) bt
#0  0x00007ff4e75f8307 in semop () from /lib64/libc.so.6
#1  0x00000000097f67d3 in sskgpwwait ()
#2  0x00000000097f5592 in skgpwwait ()
#3  0x00000000093e6ba5 in kslges ()
#4  0x00000000093e653e in kslgetl () --kernel service latching and post-wait get latch
#5  0x0000000000e6efa8 in kcbz_mbr_get ()
#6  0x0000000000e622d3 in kcbzibmlt () --kernel cache buffers buffer handling input buffer (reads a block from disk into a buffer) for multiblock read
#7  0x0000000000e6081f in kcbzib () --kernel cache buffers buffer handling input buffer (reads a block from disk into a buffer)
#8  0x00000000094a260a in kcbgtcr () --kernel cache buffers get consistent read
#9  0x0000000009508458 in ktrget3 ()
#10 0x0000000009507afe in ktrget2 ()
#11 0x000000000946a941 in kdst_fetch ()
#12 0x0000000000c77237 in kdstf00011010000km ()
#13 0x0000000009450f9f in kdsttgr ()
#14 0x00000000096faa5d in qertbFetch ()
#15 0x0000000002553e17 in qergsFetch ()
#16 0x00000000095a233a in opifch2 ()
#17 0x00000000095a6f3c in opiall0 ()
#18 0x00000000095b6958 in opikpr ()

(gdb) bt

#0 0x00007ff4e75f8307 in semop () from /lib64/libc.so.6

#1 0x00000000097f67d3 in sskgpwwait ()

#2 0x00000000097f5592 in skgpwwait ()

#3 0x00000000093e6ba5 in kslges ()

#4 0x00000000093e653e in kslgetl () --kernel service latching and post-wait get latch

#5 0x0000000000e6efa8 in kcbz_mbr_get ()

#6 0x0000000000e622d3 in kcbzibmlt () --kernel cache buffers buffer handling input buffer (reads a block from disk into a buffer) for multiblock read

#7 0x0000000000e6081f in kcbzib () --kernel cache buffers buffer handling input buffer (reads a block from disk into a buffer)

#8 0x00000000094a260a in kcbgtcr () --kernel cache buffers get consistent read

#9 0x0000000009508458 in ktrget3 ()

#10 0x0000000009507afe in ktrget2 ()

#11 0x000000000946a941 in kdst_fetch ()

#12 0x0000000000c77237 in kdstf00011010000km ()

#13 0x0000000009450f9f in kdsttgr ()

#14 0x00000000096faa5d in qertbFetch ()

#15 0x0000000002553e17 in qergsFetch ()

#16 0x00000000095a233a in opifch2 ()

#17 0x00000000095a6f3c in opiall0 ()

#18 0x00000000095b6958 in opikpr ()

其中本案例相关的关键函数如下：

kslgetl：kernel service latching and post-wait get latch
kcbzibmlt ：kernel cache buffers buffer handling input buffer (reads a block from disk into a buffer) for multiblock read
kcbzib ：kernel cache buffers buffer handling input buffer (reads a block from disk into a buffer)
kcbgtcr :kernel cache buffers get consistent read

systemtap更加细致的观察到该sql执行过程中调用的所有函数：

手动释放掉刚才持有的latch之后重新执行sql，并开启systemtap跟踪。

systemtap脚本

probe process("/oracle/app/oracle/product/11.2.0/dbhome_1/bin/oracle").function("*")
{
printf("%d\t%s: %s %x\n", pid(), execname(), ppfunc(), int_arg(1));
}

probe process("/oracle/app/oracle/product/11.2.0/dbhome_1/bin/oracle").function("*")

{

printf("%d\t%s: %s %x\n", pid(), execname(), ppfunc(), int_arg(1));

}

运行stp脚本之后执行sql

select /*+index(IDX_2)*/* from test.t where object_id=100000;

1	select /+index(IDX_2)/* from test.t where object_id=100000;

systemtap输出：

3157	oracle: kcbzib ffffffff9e6cf308
3157	oracle: kslgetl ffffffff9e74ce08
3157	oracle: ksl_get_shared_latch ffffffff9e50ffd8
3157	oracle: ksl_level_check 1
3157	oracle: kslfre ffffffff9e50ffd8
3157	oracle: ksl_get_shared_latch ffffffff9e753d88
3157	oracle: kslfre ffffffff9e753d88
3157	oracle: kslfre ffffffff9e74ce08
3157	oracle: kslfre ffffffff9e58a510
3157	oracle: kcbzibmlt 667e5168
3157	oracle: kcbz_mbr_get ffffffff9e6cf308
3157	oracle: kslgetl 6001d768
3157	oracle: kslfre 6001d768
。。。
SQL> select latch#,name,addr from v$latch where name='multiblock read objects';
LATCH# NAME                                     ADDR
---------- ---------------------------------------- ----------------
179 multiblock read objects                  000000006001D768

3157 oracle: kcbzib ffffffff9e6cf308

3157 oracle: kslgetl ffffffff9e74ce08

3157 oracle: ksl_get_shared_latch ffffffff9e50ffd8

3157 oracle: ksl_level_check 1

3157 oracle: kslfre ffffffff9e50ffd8

3157 oracle: ksl_get_shared_latch ffffffff9e753d88

3157 oracle: kslfre ffffffff9e753d88

3157 oracle: kslfre ffffffff9e74ce08

3157 oracle: kslfre ffffffff9e58a510

3157 oracle: kcbzibmlt 667e5168

3157 oracle: kcbz_mbr_get ffffffff9e6cf308

3157 oracle: kslgetl 6001d768

3157 oracle: kslfre 6001d768

。。。

SQL> select latch#,name,addr from v$latch where name='multiblock read objects';

LATCH# NAME ADDR

---------- ---------------------------------------- ----------------

179 multiblock read objects 000000006001D768

systemtap可以看到通过kslgetl函数请求的latch地址为6001d768，正好就是multiblock read objects。

当sql级别禁用动态采样时：

flush buffer cache之后执行sql：

SQL> alter system flush buffer_cache;
System altered.
SQL> select /*+DYNAMIC_SAMPLING(0) INDEX(IDX_2)*/ * from test.t where object_id=10000;

SQL> alter system flush buffer_cache;

System altered.

SQL> select /*+DYNAMIC_SAMPLING(0) INDEX(IDX_2)*/ * from test.t where object_id=10000;

[root@rac1 oracle]# stap -v o.stp -x 3157 --tmpdir=/db/tmp >/db/tmp/o.out
Pass 1: parsed user script and 479 library scripts using 279376virt/76460res/3504shr/73160data kb, in 240usr/550sys/1012real ms.
Pass 2: analyzed script: 165817 probes, 5 functions, 4 embeds, 4 globals using 1259352virt/1051972res/576shr/1053136data kb, in 57700usr/2030sys/61485real ms.
Pass 3: using cached /root/.systemtap/cache/2c/stap_2c17b2912440445c7044a9a3069feb0b_114485619.c
Pass 4: using cached /root/.systemtap/cache/2c/stap_2c17b2912440445c7044a9a3069feb0b_114485619.ko
Pass 5: starting run.
^C^Z
[4]+  Stopped                 stap -v o.stp -x 3157 --tmpdir=/db/tmp > /db/tmp/o.out
[root@rac1 oracle]# grep -i 6001d768 /db/tmp/o.out
[root@rac1 oracle]#

[root@rac1 oracle]# stap -v o.stp -x 3157 --tmpdir=/db/tmp >/db/tmp/o.out

Pass 1: parsed user script and 479 library scripts using 279376virt/76460res/3504shr/73160data kb, in 240usr/550sys/1012real ms.

Pass 2: analyzed script: 165817 probes, 5 functions, 4 embeds, 4 globals using 1259352virt/1051972res/576shr/1053136data kb, in 57700usr/2030sys/61485real ms.

Pass 3: using cached /root/.systemtap/cache/2c/stap_2c17b2912440445c7044a9a3069feb0b_114485619.c

Pass 4: using cached /root/.systemtap/cache/2c/stap_2c17b2912440445c7044a9a3069feb0b_114485619.ko

Pass 5: starting run.

^C^Z

[4]+ Stopped stap -v o.stp -x 3157 --tmpdir=/db/tmp > /db/tmp/o.out

[root@rac1 oracle]# grep -i 6001d768 /db/tmp/o.out

[root@rac1 oracle]#

并未发现该latch的请求。所以解决方案就是再加一个hint DYNAMIC_SAMPLING(0)禁用该sql的动态采样。

troubleshooting many latch free(multiblock read objects) wait when index range scan

发表回复取消回复

近期文章

近期评论

troubleshooting many latch free(multiblock read objects) wait when index range scan

发表回复 取消回复

近期文章

近期评论

发表回复取消回复