Exadata x5 Raid电池对IO性能的影响

前段时间一套Oracle Exadata X5环境遇到了严重的IO问题，从AWR top event IO延迟相当高，问题前虽然IO性能并不是很好，但这次突然的性能减半，影响对于cell multiblock physical read和direct path write，cell smart table scan wait avg ms翻倍，甚至达到100ms以上，对于oracle环境是无法接受的，当然通过分析问题在硬件层，更换RAID卡电池后恢复，10几年前遇到过因为RAID卡电池没电，影响无法使用RAID cache导致IO性能衰减的问题，当时见同事还老去中关村数据中心换RAID卡电池. 这次记录一下该问题现象。

问题前

Top 10 Foreground Events by Total Wait Time

Event	Waits	Total Wait Time (sec)	Wait Avg(ms)	% DB time	Wait Class
direct path read temp	336,705	23.6K	70	29.6	User I/O
DB CPU		23K		28.8
cell multiblock physical read	529,506	11.1K	21	14.0	User I/O
SQL*Net more data from dblink	1,311,660	5324.8	4	6.7	Network
cell smart table scan	261,849	2991.9	11	3.8	User I/O
SQL*Net message from dblink	29,177	2806.1	96	3.5	Network
direct path write	23,339	1395.6	60	1.8	User I/O
read by other session	569,076	1180.5	2	1.5	User I/O
cell single block physical read	225,337	812.3	4	1.0	User I/O
log file sync	18,809	733.5	39	.9	Commit

IO Profile

	Read+Write Per Second	Read per Second	Write Per Second
Total Requests:	3,892.6	473.8	3,418.8
Database Requests:	3,450.9	452.4	2,998.5
Optimized Requests:	434.5	362.7	71.8
Redo Requests:	419.2	0.0	419.2
Total (MB):	218.3	113.2	105.1
Database (MB):	182.3	112.9	69.4
Optimized Total (MB):	76.8	71.4	5.4
Redo (MB):	35.7	0.0	35.7
Database (blocks):	23,331.6	14,448.3	8,883.3
Via Buffer Cache (blocks):	12,331.3	7,136.4	5,195.0
Direct (blocks):	11,000.3	7,312.0	3,688.3

问题后

Top 10 Foreground Events by Total Wait Time

Event	Waits	Total Wait Time (sec)	Wait Avg(ms)	% DB time	Wait Class
log file switch (checkpoint incomplete)	805	62.5K	77639	23.5	Configuration
cell multiblock physical read	495,927	55K	111	20.7	User I/O
direct path write	49,496	21.2K	429	8.0	User I/O
direct path read temp	147,062	21.2K	144	8.0	User I/O
direct path write temp	75,682	20.9K	277	7.9	User I/O
DB CPU		18.4K		6.9
local write wait	1,260	11.9K	9430	4.5	User I/O
enq: RO – fast object reuse	2,498	10.7K	4276	4.0	Application
cell smart table scan	267,287	10.4K	39	3.9	User I/O
buffer busy waits	1,609,792	6932.4	4	2.6	Concurrency

IO Profile

	Read+Write Per Second	Read per Second	Write Per Second
Total Requests:	1,939.1	399.1	1,540.1
Database Requests:	1,742.8	320.6	1,422.2
Optimized Requests:	293.2	263.5	29.7
Redo Requests:	115.0	0.0	115.0
Total (MB):	156.0	82.4	73.6
Database (MB):	119.9	81.2	38.8
Optimized Total (MB):	40.2	37.6	2.6
Redo (MB):	34.8	0.0	34.8
Database (blocks):	15,351.0	10,391.6	4,959.5
Via Buffer Cache (blocks):	8,674.3	5,513.3	3,161.1
Direct (blocks):	6,676.7	4,878.3	1,798.4

数据文件IO延迟

问题前

问题后

从AWR我们能判断出当前数据库环境的IO问题，但是对于Exadata环境，检查稍为复杂。

日志收集
1，操作系统日志
2，exacheck报告

You need to be logged in as the oracle (RBDMS) software owner to run exachk:
[oracle@xxxx01dbm01 stg]$ ./exachk -a
CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to /u01/app/11.2.0.3/
grid?[y/n][y]y
Checking ssh user equivalency settings on all nodes in cluster
Node cm01dbm02 is configured for ssh user equivalency for oracle user
Searching for running databases . . . . .
. . . . .
List of running databases registered in OCR
1. dwprd
2. visx
3. visy
4. All of above
5. None of above
Select databases from list for checking best practices. For multiple databases, select 4 for All or
comma separated number like 1,2 etc. [1–5][4].
... Output omitted
When launched, exachk will supply several self-explanatory prompts, including prompts for the root password
on the storage servers, compute servers, InfiniBand switches, and so forth. When exachk is running, you’ll see output
from the script that resembles the following output:
... Output omitted
Collecting - Verify Hardware and Firmware on Database and Storage Servers (CheckHWnFWProfile)
[Database Server]
Collecting - Verify InfiniBand Address Resolution Protocol (ARP) Configuration on Database Servers
Collecting - Verify InfiniBand Fabric Topology (verify-topology)
Collecting - Verify InfiniBand subnet manager is running on an InfiniBand switch
Collecting - Verify Master (Rack) Serial Number is Set [Database Server]
Collecting - Verify RAID Controller Battery Condition [Database Server]
Collecting - Verify RAID Controller Battery Temperature [Database Server]
Collecting - Verify database server disk controllers use writeback cache
Collecting - root time zone check
...

You need to be logged in as the oracle (RBDMS) software owner to run exachk:

[oracle@xxxx01dbm01 stg]$ ./exachk -a

CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to /u01/app/11.2.0.3/

grid?[y/n][y]y

Checking ssh user equivalency settings on all nodes in cluster

Node cm01dbm02 is configured for ssh user equivalency for oracle user

Searching for running databases . . . . .

. . . . .

List of running databases registered in OCR

1. dwprd

2. visx

3. visy

4. All of above

5. None of above

Select databases from list for checking best practices. For multiple databases, select 4 for All or

comma separated number like 1,2 etc. [1–5][4].

... Output omitted

When launched, exachk will supply several self-explanatory prompts, including prompts for the root password

on the storage servers, compute servers, InfiniBand switches, and so forth. When exachk is running, you’ll see output

from the script that resembles the following output:

... Output omitted

Collecting - Verify Hardware and Firmware on Database and Storage Servers (CheckHWnFWProfile)

[Database Server]

Collecting - Verify InfiniBand Address Resolution Protocol (ARP) Configuration on Database Servers

Collecting - Verify InfiniBand Fabric Topology (verify-topology)

Collecting - Verify InfiniBand subnet manager is running on an InfiniBand switch

Collecting - Verify Master (Rack) Serial Number is Set [Database Server]

Collecting - Verify RAID Controller Battery Condition [Database Server]

Collecting - Verify RAID Controller Battery Temperature [Database Server]

Collecting - Verify database server disk controllers use writeback cache

Collecting - root time zone check

...

3, 检查falshcache mode
注意检查 Exadata Storage Server 的flashcachemode，对于WriteBack模式的缓冲写，比WriteThrough穿透直接到硬盘写后入控制器再返回，对于写性能更佳，所以常需要电池或RDMA技术保证cache的数据不丢失。

Write Back：当控制器Cache收到所有的传输数据后，将给主机返回数据传输完成信号
Write Through：当硬盘子系统接收到所有传输数据后，控制器将给主机返回数据传输完成信号

对于写IO量较大的场景Write-Back Flash Cache可以提升IO性能.

[root@anbob01cel01 ~]# cellcli
CellCLI: Release 12.1.2.3.5 - Production on Wed Jan 17 10:09:51 GMT 2018
Copyright (c) 2007, 2016, Oracle. All rights reserved.
CellCLI&gt; list cell attributes flashcachemode
WriteThrough
[root@anbob01cel01 ~]# dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root cellcli -e "list cell attributes flashcachemode"
xxxex2celadm01: WriteThrough
xxxex2celadm02: WriteThrough
xxxex2celadm03: WriteThrough

[root@anbob01cel01 ~]# cellcli

CellCLI: Release 12.1.2.3.5 - Production on Wed Jan 17 10:09:51 GMT 2018

CellCLI> list cell attributes flashcachemode

WriteThrough

[root@anbob01cel01 ~]# dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root cellcli -e "list cell attributes flashcachemode"

xxxex2celadm01: WriteThrough

xxxex2celadm02: WriteThrough

xxxex2celadm03: WriteThrough

也可以使用 cellcli 命令“list cell detail” 查看。对于模式切换可以在线操作。

4, oracle exadata是软硬件设计精密的一体环境，做了一些专用的优化，如Exadata Smart Flash Log Write-Back ，但是对于X7以上才涉及。

5, 检查 RAID 使用MegaCLI
由服务器硬盘组成的RAID阵列读写IO性能差，需要开启cache缓存提升性能，为保证服务器异常掉电后，存在于缓存的数据不出现丢失，raid卡通常会配置锂电池或超级电容用于保存缓存数据,或使用RDMA技术。服务器的Riad卡都带有可充电电池,电池有间隔性放电、充电的保障机制。

对于读写缓存模式RAID策略:
读–
对于HDD（机械硬盘）而言，读策略设置为Read Ahead性能更优
对于SSD（固态硬盘）而言，读策略设置为No Read Ahead性能更优
写–
对于HDD（机械硬盘）而言，写策略设置为Write Back性能更优
对于SSD（固态硬盘）而言，写策略设置为Write Though性能更优

使用 MegaCli 命令查看 raid 卡的基本信息.

[root@anbob01cel01 ~]# ls /opt/MegaRAID/MegaCli/MegaCli64
/opt/MegaRAID/MegaCli/MegaCli64
[root@anbob01cel01 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -v
MegaCLI SAS RAID Management Tool Ver 8.00.23 May 17, 2010
(c)Copyright 2010, LSI Corporation, All Rights Reserved.
Exit Code: 0x00
[root@anbob01cel01 ~]# MegaCli -AdpAllInfo -aALL
[root@anbob01cel01 ~]# ./MegaCli -LDGetProp -Cache -LAll -aAll
-- Displays information about the status of your battery-backed disk cache (BBU)
[root@anbob01cel01 ~]# MegaCli -AdpBbuCmd -GetBbuStatus -aALL

[root@anbob01cel01 ~]# ls /opt/MegaRAID/MegaCli/MegaCli64

/opt/MegaRAID/MegaCli/MegaCli64

[root@anbob01cel01 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -v

MegaCLI SAS RAID Management Tool Ver 8.00.23 May 17, 2010

Exit Code: 0x00

[root@anbob01cel01 ~]# MegaCli -AdpAllInfo -aALL

[root@anbob01cel01 ~]# ./MegaCli -LDGetProp -Cache -LAll -aAll

-- Displays information about the status of your battery-backed disk cache (BBU)

[root@anbob01cel01 ~]# MegaCli -AdpBbuCmd -GetBbuStatus -aALL

默认情况下，当RAID卡的电池的电量低于某阈值时，RAID卡固件认为此时的电池是不可用的，为了保证数据的安全，会禁用RAID的“缓存”，这种默认的机制本来是合理的，但是当RAID的缓存被禁用之后，RAID的I/O能力会大幅度下降。一般情况下，这个充放电(放电->充电)的时间可能会持续几个小时，对于I/O密集型的应用来说，由此带来的性能下降有可能是致命的，可能会导致系统I/O延迟增大、队列堆积、拖慢甚至有可能拖垮整个系统。

查看电池充放电周期：

# MegaCli -AdpBbuCmd -getBbuProperties -aALL|egrep 'Period|Next'

1	# MegaCli -AdpBbuCmd -getBbuProperties -aALL\|egrep 'Period\|Next'

手动强制充放电：

# MegaCli -AdpBbuCmd -BbuLearn –a0

1	# MegaCli -AdpBbuCmd -BbuLearn –a0

检查HBA和battery电池当前的状态

# /opt/MegaRAID/MegaCli/MegaCli64 -adpbbucmd -getbbustatus -a0|grep Battery
BatteryType: iBBU08
Battery State : Operational
Battery Pack Missing : No
Battery Replacement required : No

# /opt/MegaRAID/MegaCli/MegaCli64 -adpbbucmd -getbbustatus -a0|grep Battery

BatteryType: iBBU08

Battery State : Operational

Battery Pack Missing : No

Battery Replacement required : No

对于带有远程安装 BBU 的系统

&gt;image &gt;=11.2.3.3.0
1)root 登录
2）验证磁盘控制器 BBU 电池状态是否存在并被 RAID 控制器看到。 检测到新的 BBU 电池可能需要几分钟时间。
# /opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -a0 | grep BBU
BBU : Present
BBU : Yes
Cache When BBU Bad : Disabled
4)验证磁盘控制器 BBU 电池状态是否正常
&gt;image &gt;=12.1.2.1.0 or later:
DBMCLI&gt; LIST DBSERVER ATTRIBUTES bbustatus
&gt;image &lt;12.1.2.1.0
# /opt/oracle.cellos/compmon/exadata_mon_hw_asr.pl -list_bbu_status
BBU status: present

>image >=11.2.3.3.0

1)root 登录

2）验证磁盘控制器 BBU 电池状态是否存在并被 RAID 控制器看到。检测到新的 BBU 电池可能需要几分钟时间。

# /opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -a0 | grep BBU

BBU : Present

BBU : Yes

Cache When BBU Bad : Disabled

4)验证磁盘控制器 BBU 电池状态是否正常

>image >=12.1.2.1.0 or later:

DBMCLI> LIST DBSERVER ATTRIBUTES bbustatus

>image <12.1.2.1.0

# /opt/oracle.cellos/compmon/exadata_mon_hw_asr.pl -list_bbu_status

BBU status: present

验证当前逻辑磁盘驱动器缓存策略使用回写模式( writeback)

# MegaCli -LDGetProp -Cache -LAll -aAll
Adapter 0-VD 0(target id: 0): Cache Policy:WriteBack, ReadAhead, Cached, No Write Cache if bad BBU
Adapter 0-VD 1(target id: 1): Cache Policy:WriteBack, ReadAhead, Cached, No Write Cache if bad BBU
Adapter 0-VD 2(target id: 2): Cache Policy:WriteBack, ReadAhead, Cached, No Write Cache if bad BBU
...
-- or --
# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep -i bbu
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
...

# MegaCli -LDGetProp -Cache -LAll -aAll

Adapter 0-VD 0(target id: 0): Cache Policy:WriteBack, ReadAhead, Cached, No Write Cache if bad BBU

Adapter 0-VD 1(target id: 1): Cache Policy:WriteBack, ReadAhead, Cached, No Write Cache if bad BBU

Adapter 0-VD 2(target id: 2): Cache Policy:WriteBack, ReadAhead, Cached, No Write Cache if bad BBU

...

-- or --

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep -i bbu

Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

...

6, 检查alert 使用cellcli
使用CellCLI的list alerthistory命令监视Exadata存储服务器上的警报

-- Log in to CellCLI from an Exadata storage cell and run the following command to display your storage cell’s alert history:
CellCLI&gt; list alerthistory
CellCLI&gt; list alerthistory 2_1 detail
-- To report on your critical storage cell alerts, use the following dcli/cellcli command:
[oracle@anbob01cel01 ~]$ dcli -g ./cell_group cellcli -e list alerthistory where severity='critical'
...
severity: critical
alertAction: "Battery is either in a learn cycle or it needsreplacement. Please contact Oracle Support"
...

-- Log in to CellCLI from an Exadata storage cell and run the following command to display your storage cell’s alert history:

CellCLI> list alerthistory

CellCLI> list alerthistory 2_1 detail

-- To report on your critical storage cell alerts, use the following dcli/cellcli command:

[oracle@anbob01cel01 ~]$ dcli -g ./cell_group cellcli -e list alerthistory where severity='critical'

...

severity: critical

alertAction: "Battery is either in a learn cycle or it needsreplacement. Please contact Oracle Support"

...

对于电池的更换，可以参考MOS中的相关操作手册。

发表回复取消回复

近期文章

近期评论

Exadata x5 Raid电池对IO性能的影响

发表回复 取消回复

近期文章

近期评论

发表回复取消回复