本案例来自西区某电力客户的3节点rac一体机,频繁的出现业务性能缓慢的问题。
awr基本信息如下:
Host Name | Platform | CPUs | Cores | Sockets | Memory (GB) |
---|---|---|---|---|---|
xxxxx | Linux x86 64-bit | 192 | 96 | 8 | 504.63 |
Snap Id | Snap Time | Sessions | Cursors/Session | Instances | |
---|---|---|---|---|---|
Begin Snap: | 62201 | 02-Apr-22 08:00:08 | 1824 | 5.6 | 3 |
End Snap: | 62205 | 02-Apr-22 10:00:07 | 2963 | 6.5 | 3 |
Elapsed: | 119.99 (mins) | ||||
DB Time: | 45,761.81 (mins) |
Top 10 Foreground Events by Total Wait Time
Event | Waits | Total Wait Time (sec) | Wait Avg(ms) | % DB time | Wait Class |
---|---|---|---|---|---|
gc buffer busy acquire | 13,897,498 | 589.2K | 42 | 21.5 | Cluster |
enq: US – contention | 175,549 | 425K | 2421 | 15.5 | Other |
gc cr grant 2-way | 14,038,191 | 337.2K | 24 | 12.3 | Cluster |
enq: TX – index contention | 87,093 | 172.4K | 1979 | 6.3 | Concurrency |
row cache lock | 473,144 | 157K | 332 | 5.7 | Concurrency |
gc cr multi block request | 1,855,834 | 138.6K | 75 | 5.0 | Cluster |
DB CPU | 133.2K | 4.9 | |||
enq: TA – contention | 968 | 111.9K | 115571 | 4.1 | Other |
buffer busy waits | 259,371 | 103.3K | 398 | 3.8 | Concurrency |
gc buffer busy release | 210,472 | 100.3K | 476 | 3.7 | Cluster |
可以看到大量的gc相关event等待,并且延迟也很高。通常分析gc问题我都习惯于去看看awr的RAC Statistics
Global Cache Load Profile
Per Second | Per Transaction | |
---|---|---|
Global Cache blocks received: | 706.43 | 17.77 |
Global Cache blocks served: | 895.22 | 22.52 |
GCS/GES messages received: | 12,705.21 | 319.57 |
GCS/GES messages sent: | 15,198.13 | 382.27 |
DBWR Fusion writes: | 43.44 | 1.09 |
Estd Interconnect traffic (KB) | 18,263.06 |
Global Cache Efficiency Percentages (Target local+remote 100%)
Buffer access – local cache %: | 99.10 |
Buffer access – remote cache %: | 0.07 |
Buffer access – disk %: | 0.83 |
Global Cache and Enqueue Services – Workload Characteristics
Avg global enqueue get time (ms): | 19.9 |
Avg global cache cr block receive time (ms): | 25.2 |
Avg global cache current block receive time (ms): | 71.3 |
Avg global cache cr block build time (ms): | 0.0 |
Avg global cache cr block send time (ms): | 0.0 |
Global cache log flushes for cr blocks served %: | 7.4 |
Avg global cache cr block flush time (ms): | 2.5 |
Avg global cache current block pin time (ms): | 1.7 |
Avg global cache current block send time (ms): | 0.0 |
Global cache log flushes for current blocks served %: | 6.0 |
Avg global cache current block flush time (ms): | 2.8 |
Global Cache and Enqueue Services – Messaging Statistics
Avg message sent queue time (ms): | 0.0 |
Avg message sent queue time on ksxp (ms): | 111.0 |
Avg message received queue time (ms): | 0.0 |
Avg GCS message process time (ms): | 0.0 |
Avg GES message process time (ms): | 0.1 |
% of direct sent messages: | 33.27 |
% of indirect sent messages: | 62.49 |
% of flow controlled messages: | 4.24 |
私网流量18M/s,对于一体机的ib来说并不算什么,看到Avg message sent queue time on ksxp (ms)延迟非常高,awr就基本不用看了可以定位私网性能出现了严重的问题。
Avg message sent queue time on ksxp The time when the peer receives the message and returns ACK. This indicator is very important and directly reflects the network delay, generally less than 1ms
Interconnect Device Statistics
- Throughput and errors of interconnect devices (at OS level)
- All throughput numbers are megabytes per second
Device Name | IP Address | Public | Source | Send Mbytes/sec | Send Errors | Send Dropped | Send Buffer Overrun | Send Carrier Lost | Receive Mbytes/sec | Receive Errors | Receive Dropped | Receive Buffer Overrun | Receive Frame Errors |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bond0 | 10.xxx.xxx.x | OS dependent software | 16.28 | 0 | 0 | 0 | 0 | 12.75 | 0 | 0 | 37,782 | 0 |
通过awr私网设备的统计信息可以看到Receive Buffer Overrun很高,本来以为又是udp receive buffer不足的问题,正准备结案时,再仔细一看发现了一个奇怪的现象,db层面的私网设备居然是bond0,并不是我们正常的一体机的ib,ip也不是私网ip,而是public network。
这是怎么回事呢?
继续排查发现,集群层面通信使用的都是ib网卡上的haip,都是正确的,但是db实例使用的是public network,在排除db实例设置了参数cluster_interconnects之后,可以去看看alert日志,当db实例启动时会输出私网的ip和协议,从中我们发现故事得从20年的8月29日说起。。。
1 2 3 4 5 6 7 8 9 10 |
Sat Aug 29 00:42:55 2020 Cluster communication is configured to use the following interface(s) for this instance 169.254.104.231 169.254.209.215 cluster interconnect IPC version:Oracle UDP/IP (generic) Sat Aug 29 01:06:19 2020 Cluster communication is configured to use the following interface(s) for this instance 10.185.241.176 cluster interconnect IPC version:Oracle UDP/IP (generic) |
可以看到20年8月29日短时间内重启了两次db实例,第二次启动时,私网的ip发生了改变。继续分析
1 2 3 4 5 6 7 8 9 |
2020-08-29 01:06:12.798: [USER(74999)]CRS-2316:Fatal error: cannot initialize GPnP, CLSGPNP_ERR (Generic GPnP error). kggpnpInit: failed to init gpnp WARNING: No cluster interconnect has been specified. Depending on the communication driver configured Oracle cluster traffic may be directed to the public interface of this machine. Oracle recommends that RAC clustered databases be configured with a private interconnect for enhanced security and performance. |
看到了一段核心错误,无法初始化gpnp从而导致从gpnp profile中获取公网私网信息失败。
在查看db的patch时,发现db打patch的时间与私网第一次发生异常的时间点非常吻合
1 2 3 4 5 |
Interim patches (2) : Patch 31103343 : applied on Sat Aug 29 01:05:03 CST 2020 Unique Patch ID: 23687560 Patch description: "Database Patch Set Update : 11.2.0.4.200714 (31103343)" |
查阅mos文档,有一篇比较吻合,11.2.0.4 RAC Database Instance Fails to Start with “No connectivity to other instances in the cluster during startup” After Applying 11.2.0.4 DB PSU (Doc ID 2471441.1)
1 2 3 4 5 6 7 8 9 |
CHANGES DB PSU only was applied to 11.2.0.4 db home CAUSE For a database instance running in a RAC cluster, both DB PSU component and OCW PSU component of the GI PSU must be applied. In this case, only the DB PSU component was applied to the DB home without applying the OCW PSU component of the GI PSU. Verify using "opatch lsniventory" command on the database home. |
很巧,本案例也是只打了db patch,并没有打gi的patch。
在实际的生产维护中,经常会看到gi和db psu不一致、gi和db版本不一致等等奇葩操作,强烈建议使用同版本同patch的gi和db软件,这会减少很多不必要的故障和麻烦。