案例：troubleshooting gc buffer busy acquire(二)

本案例来自西区某电力客户的3节点rac一体机，频繁的出现业务性能缓慢的问题。

awr基本信息如下：

Host Name	Platform	CPUs	Cores	Sockets	Memory (GB)
xxxxx	Linux x86 64-bit	192	96	8	504.63

Snap Id	Snap Time	Sessions	Cursors/Session	Instances
Begin Snap:	62201	02-Apr-22 08:00:08	1824	5.6	3
End Snap:	62205	02-Apr-22 10:00:07	2963	6.5	3
Elapsed:		119.99 (mins)
DB Time:		45,761.81 (mins)

Top 10 Foreground Events by Total Wait Time

Event	Waits	Total Wait Time (sec)	Wait Avg(ms)	% DB time	Wait Class
gc buffer busy acquire	13,897,498	589.2K	42	21.5	Cluster
enq: US – contention	175,549	425K	2421	15.5	Other
gc cr grant 2-way	14,038,191	337.2K	24	12.3	Cluster
enq: TX – index contention	87,093	172.4K	1979	6.3	Concurrency
row cache lock	473,144	157K	332	5.7	Concurrency
gc cr multi block request	1,855,834	138.6K	75	5.0	Cluster
DB CPU		133.2K		4.9
enq: TA – contention	968	111.9K	115571	4.1	Other
buffer busy waits	259,371	103.3K	398	3.8	Concurrency
gc buffer busy release	210,472	100.3K	476	3.7	Cluster

可以看到大量的gc相关event等待，并且延迟也很高。通常分析gc问题我都习惯于去看看awr的RAC Statistics

Global Cache Load Profile

	Per Second	Per Transaction
Global Cache blocks received:	706.43	17.77
Global Cache blocks served:	895.22	22.52
GCS/GES messages received:	12,705.21	319.57
GCS/GES messages sent:	15,198.13	382.27
DBWR Fusion writes:	43.44	1.09
Estd Interconnect traffic (KB)	18,263.06

Global Cache Efficiency Percentages (Target local+remote 100%)

Buffer access – local cache %:	99.10
Buffer access – remote cache %:	0.07
Buffer access – disk %:	0.83

Global Cache and Enqueue Services – Workload Characteristics

Avg global enqueue get time (ms):	19.9
Avg global cache cr block receive time (ms):	25.2
Avg global cache current block receive time (ms):	71.3
Avg global cache cr block build time (ms):	0.0
Avg global cache cr block send time (ms):	0.0
Global cache log flushes for cr blocks served %:	7.4
Avg global cache cr block flush time (ms):	2.5
Avg global cache current block pin time (ms):	1.7
Avg global cache current block send time (ms):	0.0
Global cache log flushes for current blocks served %:	6.0
Avg global cache current block flush time (ms):	2.8

Global Cache and Enqueue Services – Messaging Statistics

Avg message sent queue time (ms):	0.0
Avg message sent queue time on ksxp (ms):	111.0
Avg message received queue time (ms):	0.0
Avg GCS message process time (ms):	0.0
Avg GES message process time (ms):	0.1
% of direct sent messages:	33.27
% of indirect sent messages:	62.49
% of flow controlled messages:	4.24

私网流量18M/s，对于一体机的ib来说并不算什么，看到Avg message sent queue time on ksxp (ms)延迟非常高，awr就基本不用看了可以定位私网性能出现了严重的问题。

Avg message sent queue time on ksxp The time when the peer receives the message and returns ACK. This indicator is very important and directly reflects the network delay, generally less than 1ms

Interconnect Device Statistics

Throughput and errors of interconnect devices (at OS level)
All throughput numbers are megabytes per second

Device Name	IP Address	Public	Source	Send Mbytes/sec	Send Errors	Send Dropped	Send Buffer Overrun	Send Carrier Lost	Receive Mbytes/sec	Receive Errors	Receive Dropped	Receive Buffer Overrun	Receive Frame Errors
bond0	10.xxx.xxx.x		OS dependent software	16.28	0	0	0	0	12.75	0	0	37,782	0

通过awr私网设备的统计信息可以看到Receive Buffer Overrun很高，本来以为又是udp receive buffer不足的问题，正准备结案时，再仔细一看发现了一个奇怪的现象，db层面的私网设备居然是bond0，并不是我们正常的一体机的ib，ip也不是私网ip，而是public network。

这是怎么回事呢？

继续排查发现，集群层面通信使用的都是ib网卡上的haip，都是正确的，但是db实例使用的是public network，在排除db实例设置了参数cluster_interconnects之后，可以去看看alert日志，当db实例启动时会输出私网的ip和协议，从中我们发现故事得从20年的8月29日说起。。。

Sat Aug 29 00:42:55 2020
Cluster communication is configured to use the following interface(s) for this instance
169.254.104.231
169.254.209.215
cluster interconnect IPC version:Oracle UDP/IP (generic)
Sat Aug 29 01:06:19 2020
Cluster communication is configured to use the following interface(s) for this instance
10.185.241.176
cluster interconnect IPC version:Oracle UDP/IP (generic)

Sat Aug 29 00:42:55 2020

Cluster communication is configured to use the following interface(s) for this instance

169.254.104.231

169.254.209.215

cluster interconnect IPC version:Oracle UDP/IP (generic)

Sat Aug 29 01:06:19 2020

Cluster communication is configured to use the following interface(s) for this instance

10.185.241.176

cluster interconnect IPC version:Oracle UDP/IP (generic)

可以看到20年8月29日短时间内重启了两次db实例，第二次启动时，私网的ip发生了改变。继续分析

2020-08-29 01:06:12.798: 
[USER(74999)]CRS-2316:Fatal error: cannot initialize GPnP, CLSGPNP_ERR (Generic GPnP error). 
kggpnpInit: failed to init gpnp
WARNING: No cluster interconnect has been specified. Depending on
the communication driver configured Oracle cluster traffic
may be directed to the public interface of this machine.
Oracle recommends that RAC clustered databases be configured
with a private interconnect for enhanced security and
performance.

2020-08-29 01:06:12.798:

[USER(74999)]CRS-2316:Fatal error: cannot initialize GPnP, CLSGPNP_ERR (Generic GPnP error).

kggpnpInit: failed to init gpnp

WARNING: No cluster interconnect has been specified. Depending on

the communication driver configured Oracle cluster traffic

may be directed to the public interface of this machine.

Oracle recommends that RAC clustered databases be configured

with a private interconnect for enhanced security and

performance.

看到了一段核心错误，无法初始化gpnp从而导致从gpnp profile中获取公网私网信息失败。

在查看db的patch时，发现db打patch的时间与私网第一次发生异常的时间点非常吻合

Interim patches (2) :
Patch  31103343     : applied on Sat Aug 29 01:05:03 CST 2020
Unique Patch ID:  23687560
Patch description:  "Database Patch Set Update : 11.2.0.4.200714 (31103343)"

Interim patches (2) :

Patch 31103343 : applied on Sat Aug 29 01:05:03 CST 2020

Unique Patch ID: 23687560

Patch description: "Database Patch Set Update : 11.2.0.4.200714 (31103343)"

查阅mos文档，有一篇比较吻合，11.2.0.4 RAC Database Instance Fails to Start with “No connectivity to other instances in the cluster during startup” After Applying 11.2.0.4 DB PSU (Doc ID 2471441.1)

CHANGES
DB PSU only was applied to 11.2.0.4 db home
CAUSE
For a database instance running in a RAC cluster, both DB PSU component and OCW PSU component of the GI PSU must be applied.
In this case, only the DB PSU component was applied to the DB home without applying the OCW PSU component of the GI PSU.
Verify using "opatch lsniventory" command on the database home.

CHANGES

DB PSU only was applied to 11.2.0.4 db home

CAUSE

For a database instance running in a RAC cluster, both DB PSU component and OCW PSU component of the GI PSU must be applied.

In this case, only the DB PSU component was applied to the DB home without applying the OCW PSU component of the GI PSU.

Verify using "opatch lsniventory" command on the database home.

很巧，本案例也是只打了db patch，并没有打gi的patch。

在实际的生产维护中，经常会看到gi和db psu不一致、gi和db版本不一致等等奇葩操作，强烈建议使用同版本同patch的gi和db软件，这会减少很多不必要的故障和麻烦。

案例：troubleshooting gc buffer busy acquire(二)

Interconnect Device Statistics

发表回复取消回复

近期文章

近期评论

案例：troubleshooting gc buffer busy acquire(二)

Interconnect Device Statistics

发表回复 取消回复

近期文章

近期评论

发表回复取消回复