案例:troubleshooting gc buffer busy acquire(二)

本案例来自西区某电力客户的3节点rac一体机,频繁的出现业务性能缓慢的问题。

awr基本信息如下:

Host Name Platform CPUs Cores Sockets Memory (GB)
xxxxx Linux x86 64-bit 192 96 8 504.63
Snap Id Snap Time Sessions Cursors/Session Instances
Begin Snap: 62201 02-Apr-22 08:00:08 1824 5.6 3
End Snap: 62205 02-Apr-22 10:00:07 2963 6.5 3
Elapsed: 119.99 (mins)
DB Time: 45,761.81 (mins)

Top 10 Foreground Events by Total Wait Time

 

Event Waits Total Wait Time (sec) Wait Avg(ms) % DB time Wait Class
gc buffer busy acquire 13,897,498 589.2K 42 21.5 Cluster
enq: US – contention 175,549 425K 2421 15.5 Other
gc cr grant 2-way 14,038,191 337.2K 24 12.3 Cluster
enq: TX – index contention 87,093 172.4K 1979 6.3 Concurrency
row cache lock 473,144 157K 332 5.7 Concurrency
gc cr multi block request 1,855,834 138.6K 75 5.0 Cluster
DB CPU 133.2K 4.9
enq: TA – contention 968 111.9K 115571 4.1 Other
buffer busy waits 259,371 103.3K 398 3.8 Concurrency
gc buffer busy release 210,472 100.3K 476 3.7 Cluster

可以看到大量的gc相关event等待,并且延迟也很高。通常分析gc问题我都习惯于去看看awr的RAC Statistics

Global Cache Load Profile

Per Second Per Transaction
Global Cache blocks received: 706.43 17.77
Global Cache blocks served: 895.22 22.52
GCS/GES messages received: 12,705.21 319.57
GCS/GES messages sent: 15,198.13 382.27
DBWR Fusion writes: 43.44 1.09
Estd Interconnect traffic (KB) 18,263.06

Global Cache Efficiency Percentages (Target local+remote 100%)

Buffer access – local cache %: 99.10
Buffer access – remote cache %: 0.07
Buffer access – disk %: 0.83

Global Cache and Enqueue Services – Workload Characteristics

Avg global enqueue get time (ms): 19.9
Avg global cache cr block receive time (ms): 25.2
Avg global cache current block receive time (ms): 71.3
Avg global cache cr block build time (ms): 0.0
Avg global cache cr block send time (ms): 0.0
Global cache log flushes for cr blocks served %: 7.4
Avg global cache cr block flush time (ms): 2.5
Avg global cache current block pin time (ms): 1.7
Avg global cache current block send time (ms): 0.0
Global cache log flushes for current blocks served %: 6.0
Avg global cache current block flush time (ms): 2.8

Global Cache and Enqueue Services – Messaging Statistics

Avg message sent queue time (ms): 0.0
Avg message sent queue time on ksxp (ms): 111.0
Avg message received queue time (ms): 0.0
Avg GCS message process time (ms): 0.0
Avg GES message process time (ms): 0.1
% of direct sent messages: 33.27
% of indirect sent messages: 62.49
% of flow controlled messages: 4.24

私网流量18M/s,对于一体机的ib来说并不算什么,看到Avg message sent queue time on ksxp (ms)延迟非常高,awr就基本不用看了可以定位私网性能出现了严重的问题。

Avg message sent queue time on ksxp The time when the peer receives the message and returns ACK. This indicator is very important and directly reflects the network delay, generally less than 1ms

Interconnect Device Statistics

  • Throughput and errors of interconnect devices (at OS level)
  • All throughput numbers are megabytes per second
Device Name IP Address Public Source Send Mbytes/sec Send Errors Send Dropped Send Buffer Overrun Send Carrier Lost Receive Mbytes/sec Receive Errors Receive Dropped Receive Buffer Overrun Receive Frame Errors
bond0 10.xxx.xxx.x OS dependent software 16.28 0 0 0 0 12.75 0 0 37,782 0

通过awr私网设备的统计信息可以看到Receive Buffer Overrun很高,本来以为又是udp receive buffer不足的问题,正准备结案时,再仔细一看发现了一个奇怪的现象,db层面的私网设备居然是bond0,并不是我们正常的一体机的ib,ip也不是私网ip,而是public network。

这是怎么回事呢?

继续排查发现,集群层面通信使用的都是ib网卡上的haip,都是正确的,但是db实例使用的是public network,在排除db实例设置了参数cluster_interconnects之后,可以去看看alert日志,当db实例启动时会输出私网的ip和协议,从中我们发现故事得从20年的8月29日说起。。。

可以看到20年8月29日短时间内重启了两次db实例,第二次启动时,私网的ip发生了改变。继续分析

看到了一段核心错误,无法初始化gpnp从而导致从gpnp profile中获取公网私网信息失败。

在查看db的patch时,发现db打patch的时间与私网第一次发生异常的时间点非常吻合

查阅mos文档,有一篇比较吻合,11.2.0.4 RAC Database Instance Fails to Start with “No connectivity to other instances in the cluster during startup” After Applying 11.2.0.4 DB PSU (Doc ID 2471441.1)

很巧,本案例也是只打了db patch,并没有打gi的patch。

在实际的生产维护中,经常会看到gi和db psu不一致、gi和db版本不一致等等奇葩操作,强烈建议使用同版本同patch的gi和db软件,这会减少很多不必要的故障和麻烦。

此条目发表在Oracle, Oracle troubleshooting分类目录,贴了, 标签。将固定链接加入收藏夹。

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注