近日,某客户的一套生产库19c CDB无法启动,该CDB有17个PDB,可见影响面还是挺大的,数据库open的时候报错如下:
|
1 2 3 4 5 |
ORA-00600: internal error code, arguments: [kpdbIsObjectStorePdb], [0], [], [], [], [], [], [], [], [], [], [] <error barrier> at 0x7ffea370d018 placed dbsdrv.c@4959 ORA-00600: internal error code, arguments: [kpdbIsObjectStorePdb], [0], [], [], [], [], [], [], [], [], [], [] 2026-04-27T22:22:54.577794+08:00 Incident 2099797 created, dump file: /u01/app/oracle/diag/rdbms/prod/prod2/incident/incdir_2099797/prod2_ora_2819_i2099797.trc |
600错误的kpdbIsObjectStorePdb,我是第一次遇到,并且在google和mos搜索了一圈,没有一篇文档有关于这个错误的信息。不过多年的oracle深入学习,早就造就了一身分析未知问题的身手,例如之前的一个案例http://www.minniebaby.tech/2022/06/02/%e6%a1%88%e4%be%8b%ef%bc%9atroubleshooting-ora-00600-internal-error-code-arguments-25015-when-drop-tablespace/
看到kpdbIsObjectStorePdb函数,首先大概猜测一下这个函数到底啥意思
|
1 |
kpdb - kernel programmatic interface pluggable database |
看着大概是判断是否是PDB容器的意思,具体分析方法还是老办法,oracle的利器10046
报错源头定位:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
2026-04-27T22:22:53.492508+08:00 Incident 2099796 created, dump file: /u01/app/oracle/diag/rdbms/prod/prod2/incident/incdir_2099796/prod2_ora_2819_i2099796.trc ORA-00600: internal error code, arguments: [kpdbIsObjectStorePdb], [0], [], [], [], [], [], [], [], [], [], [] PARSING IN CURSOR #139805230331240 len=81 dep=1 uid=0 oct=3 lid=0 tim=6573478107484 hv=2407721919 ad='37fea1420' sqlid='d5jw1dy7s5uxz' select c.obj#, c.con_id#, o.name from container$ c, obj$ o where c.obj# = o.obj# END OF STMT PARSE #139805230331240:c=2794,e=3779,p=0,cr=6,cu=0,mis=1,r=0,dep=1,og=4,plh=1415307720,tim=6573478107484 EXEC #139805230331240:c=44,e=44,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,plh=1415307720,tim=6573478107574 FETCH #139805230331240:c=50,e=50,p=0,cr=5,cu=0,mis=0,r=1,dep=1,og=4,plh=1415307720,tim=6573478107643 STAT #139805230331240 id=1 cnt=18 pid=0 pos=1 obj=0 op='HASH JOIN (cr=73 pr=9 pw=0 str=1 time=57 us cost=36 size=561 card=17)' STAT #139805230331240 id=2 cnt=18 pid=1 pos=1 obj=0 op='NESTED LOOPS (cr=73 pr=9 pw=0 str=1 time=288 us cost=36 size=561 card=17)' STAT #139805230331240 id=3 cnt=18 pid=2 pos=1 obj=0 op='NESTED LOOPS (cr=55 pr=0 pw=0 str=1 time=232 us cost=36 size=561 card=17)' STAT #139805230331240 id=4 cnt=18 pid=3 pos=1 obj=0 op='STATISTICS COLLECTOR (cr=2 pr=0 pw=0 str=1 time=55 us)' STAT #139805230331240 id=5 cnt=18 pid=4 pos=1 obj=162 op='TABLE ACCESS FULL CONTAINER$ (cr=2 pr=0 pw=0 str=1 time=19 us cost=2 size=136 card=17)' STAT #139805230331240 id=6 cnt=18 pid=3 pos=2 obj=36 op='INDEX RANGE SCAN I_OBJ1 (cr=53 pr=0 pw=0 str=18 time=83 us cost=1 size=0 card=1)' STAT #139805230331240 id=7 cnt=18 pid=2 pos=2 obj=18 op='TABLE ACCESS BY INDEX ROWID OBJ$ (cr=18 pr=9 pw=0 str=18 time=2575 us cost=2 size=25 card=1)' STAT #139805230331240 id=8 cnt=0 pid=1 pos=2 obj=40 op='INDEX FAST FULL SCAN I_OBJ5 (cr=0 pr=0 pw=0 str=0 time=0 us cost=1 size=25 card=1)' |
原来kpdbIsObjectStorePdb函数是通过select c.obj#, c.con_id#, o.name from container$ c, obj$ o where c.obj# = o.obj#去判断的,熟悉10046的人应该很容易看出结果返回了18行,不是17个pdb吗???
继续分析10046,分析了两条重要线索
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
===================== PARSING IN CURSOR #139805230068816 len=332 dep=1 uid=0 oct=3 lid=0 tim=6573476161617 hv=2698389488 ad='37ff95cf0' sqlid='acmvv4fhdc9zh' select obj#,type#,ctime,mtime,stime, status, dataobj#, flags, oid$, spare1, spare2, spare3, signature, spare7, spare8, spare9, nvl(dflcollid, 16382), creappid, creverid, modappid, modverid, crepatchid, modpatchid from obj$ where owner#=:1 and name=:2 and namespace=:3 and remoteowner is null and linkname is null and subname is null BINDS #139805230068816: Bind#0 oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00 oacflg=00 fl2=1000001 frm=00 csi=00 siz=80 off=0 kxsbbbfp=7f26efc82f68 bln=22 avl=02 flg=05 value=1 Bind#1 oacdty=01 mxl=32(07) mxlc=00 mal=00 scl=00 pre=00 oacflg=10 fl2=0001 frm=01 csi=873 siz=0 off=24 kxsbbbfp=7f26efc82f80 bln=32 avl=07 flg=01 value="NON$CDB" Bind#2 oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00 oacflg=00 fl2=1000001 frm=00 csi=00 siz=0 off=56 kxsbbbfp=7f26efc82fa0 bln=22 avl=02 flg=01 value=88 EXEC #139805230068816:c=0,e=70,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,plh=813480514,tim=6573478121775 WAIT #139805230068816: nam='db file sequential read' ela= 143 file#=1 block#=23482 blocks=1 obj#=37 tim=6573478121947 FETCH #139805230068816:c=0,e=196,p=1,cr=3,cu=0,mis=0,r=1,dep=1,og=4,plh=813480514,tim=6573478121978 |
存在一个叫NON$CDB的PDB,因为fetch的r=1,说明这个sql用这些bind值返回了1行数据
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
===================== PARSING IN CURSOR #139805208596896 len=376 dep=1 uid=0 oct=3 lid=0 tim=6573478113412 hv=2241375609 ad='39fdb7f30' sqlid='673rrdf2tjbbt' select con_id#, dbid, con_uid, status, create_scnwrp, create_scnbas, clnscnwrp, clnscnbas, rdba, flags, vsn, fed_root_con_id#, undoscn, srcpdb, linkname, upgrade_priority, rafn#, containers_port, containers_host, srcpdbuid, lastrcvscn, f_cdb_dbid, uscn, f_con_id#, undots, refreshint, postplugscn, postplugtime, tenant_id, snapint, credname from container$ where obj#=:1 BINDS #139805208596896: Bind#0 oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00 oacflg=00 fl2=1000001 frm=00 csi=00 siz=24 off=0 kxsbbbfp=7f26efc81e20 bln=22 avl=04 flg=05 value=385936 EXEC #139805208596896:c=0,e=39,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,plh=2644323136,tim=6573478122093 FETCH #139805208596896:c=0,e=8,p=0,cr=2,cu=0,mis=0,r=1,dep=1,og=4,plh=2644323136,tim=6573478122112 |
这个NON$CDB的OBJ#很可能就是385936
这里可以根据10046读取的block,bbed看block,也可以mount下面dump block分析,我采用的是600错误自己生成errorstack里面的blockdump
|
1 2 3 4 5 6 7 8 9 10 |
row#86[4037] flag: -------, lock: 0, len=36, data:(6): 00 40 e8 83 00 17 col 0; len 2; (2): c1 02 col 1; len 7; (7): 4e 4f 4e 24 43 44 42 ---NON$CDB col 2; len 2; (2): c1 59 col 3; NULL col 4; NULL col 5; NULL col 6; len 3; (3): c2 02 0c col 7; len 1; (1): 80 col 8; len 4; (4): c3 27 3c 25 ---385936 |
这个PDB在OBJ$的I_OBJ5 索引上确实存在,下面是container$的blockdump,从命名来看确实多了一条
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
tab 0, row 17, @0x1270 tl: 64 fb: --H-FL-- lb: 0x0 cc: 33 col 0: [ 4] c3 27 3c 25 ---385936 col 1: [ 1] 80 col 2: [ 6] c5 05 5d 22 0b 4c col 3: [ 1] 80 col 4: [ 2] c1 03 col 5: [ 1] 80 col 6: [ 1] 80 col 7: [ 1] 80 col 8: [ 1] 80 col 9: [ 5] c4 05 14 31 19 col 10: [ 1] 80 col 11: [ 1] 80 col 12: *NULL* col 13: *NULL* col 14: *NULL* col 15: *NULL* col 16: *NULL* col 17: *NULL* col 18: *NULL* col 19: *NULL* col 20: *NULL* col 21: [ 1] 80 col 22: *NULL* col 23: *NULL* col 24: *NULL* col 25: *NULL* col 26: *NULL* col 27: *NULL* col 28: *NULL* col 29: *NULL* col 30: [ 1] 80 col 31: *NULL* col 32: [ 1] 80 |
这不是CDB吗?怎么会有一条NON$CDB??? 可能是ORACLE用于判断是否是CDB的。找了一个正常的CDB数据库10046 open跟踪
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
BINDS #140717075966024: Bind#0 oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00 oacflg=00 fl2=1000001 frm=00 csi=00 siz=176 off=0 kxsbbbfp=7ffb3d9ab630 bln=22 avl=02 flg=05 value=1 Bind#1 oacdty=01 mxl=128(07) mxlc=00 mal=00 scl=00 pre=00 oacflg=10 fl2=0001 frm=01 csi=873 siz=0 off=24 kxsbbbfp=7ffb3d9ab648 bln=128 avl=07 flg=01 value="NON$CDB" Bind#2 oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00 oacflg=00 fl2=1000001 frm=00 csi=00 siz=0 off=152 kxsbbbfp=7ffb3d9ab6c8 bln=22 avl=02 flg=01 value=88 EXEC #140717075966024:c=140,e=140,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,plh=813480514,tim=18347345278418 FETCH #140717075966024:c=12,e=11,p=0,cr=3,cu=0,mis=0,r=0,dep=1,og=4,plh=813480514,tim=18347345278443 |
确实都会做这一步检查,但是正常的CDB环境返回是0(r=0),那么问题就很清楚了,解决方法就是删除container$和obj$的obj#=385936的记录
删除方法有很多,可以bbed去改,这里需要改index又要改表,还不止一个表和索引,很麻烦,这里想到了之前分析oracle open的过程阶段性的重要函数
oracle open大致核心过程:
- kcfopd:open begin
- kcvcrv:oracle open一致性检查,如果需要则进行crash recovery中的SMON: enabling cache recovery
- kqlblfc:bootstrap以及字典完整性检查,这部分任何一个地方有报错,则会报ORA-00704 bootstrap process failure
- ktuini:undo初始化以及修改undo$的undo段状态
- kturRecoverUndoSegment:扫描undo段是否有活动事务
ktusmiut_init_ut->ktusmout_online_ut->ktusmous_online_undoseg->ktusmOnlineUndoSegments->ktusmOnlineOneSegment->ktuOnlineUndoSegment online undo segment - ktcCommitTxn:开始crash recovery中的SMON: enabling tx recovery
- kokiasg:数据字典校验
这个case明显已经到了数据字典校验阶段了,那么在报错函数kpdbIsObjectStorePdb打上断点,就可以进库里面去删除了
|
1 2 3 4 5 |
(gdb) b kpdbIsObjectStorePdb Breakpoint 1 at 0x2b18f10 (gdb) c Continuing. Breakpoint 1, 0x0000000002b18f10 in kpdbIsObjectStorePdb () |
|
1 2 3 |
delete from obj$ where obj#=385936; delete from container$ where obj#=385936; commit; |
然后再次重启数据库,绕过了这个错误,后续虽然还有报错都是非常简单的一些报错了,比如常见的ORA-01209
最后回归问题本身,为什么会多出一条NON$CDB的记录呢?翻看之前的alert,发现有人重建了控制文件,并且只写了root cdb的数据文件,很可能是oracle判断是NON-CDB然后自己写入的(猜测),真是够乱来的。
很久没有碰oracle了,目前也只有恢复数据库能让我碰一下了,这个错误ORA-00600 kpdbIsObjectStorePdb没有遇到过也没搜到过,也算全网首发了,记录一下
over!!!