本案例来自北区某客户,数据库版本信息为AIX 7.1 RAC 11.2.0.4,在删除一个2T的空间时,报错ORA-00600。
1 2 3 4 5 6 |
SQL> DROP TABLESPACE "PROD" INCLUDING CONTENTS AND DATAFILES CASCADE CONSTRAINTS; DROP TABLESPACE "PROD" INCLUDING CONTENTS AND DATAFILES CASCADE CONSTRAINTS * ERROR at line 1: ORA-00600: internal error code, arguments: [25015], [13], [9], [30], [], [], [], [], [], [], [], [] |
看到ORA-00600,通常先去逛一下mos,发现25015没有查到任何相关文档,google也找了一圈也没有找到,没办法只能靠自己了。
通常ORA-00600的第一个Argument都有特殊的含义,要么是函数,要么是数字。通过第一个参数都可以缩小排查范围,例如本例的25015。
1 |
[250XX] kft.c Kernel File management Tablespace component. |
这属于一个表空间相关的报错,本身就要drop tablespace,这次这个信息有点鸡肋。
那么后面的Argument是什么意思呢?需要猜测了,第一反应就是肯定有一个是TS#。
1 2 3 4 5 |
SQL> select ts# from ts$ where name='PROD'; TS# ---------- 13 |
运气不错,第二个参数一查就是要删除的表空间号。9和30暂时猜不出是啥意思,这个时候需要用到10046了,10046是分析此类问题的最大利器。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
===================== PARSING IN CURSOR #4577923312 len=182 dep=1 uid=0 oct=6 lid=0 tim=64067949074704 hv=810151256 ad='700012556b7b4e0' sqlid='7xrhv80s4mvas' update file$ set blocks=:2,ts#=DECODE(:3,-1,NULL,:3),status$=:4, relfile#=DECODE(:5,0,NULL,:5),maxextend=:6,inc=:7,crscnwrp=:8,crscnbas=:9,spare1=DECODE(:10,0,NULL,:10)where file#=:1 END OF STMT ... ... ... ===================== PARSING IN CURSOR #4577978904 len=338 dep=1 uid=0 oct=6 lid=0 tim=64067949338948 hv=3318318274 ad='70001259ef40a90' sqlid='cpjw5cr2wm162' update ts$ set name=:2,online$=:3,contents$=:4,undofile#=:5,undoblock#=:6, blocksize=:7,dflmaxext=:8,dflinit=:9,dflincr=:10,dflextpct=:11,dflminext=:12,dflminlen=:13,owner#=:14,scnwrp=:15,scnbas=:16,pitrscnwrp=:17,pitrscnbas=:18,dflogging=:19,bitmapped=:20,inc#=:21,flags=:22,plugged=:23,spare1=:24,spare2=:25,affstrength=:26 where ts#=:1 END OF STMT ... ... ... WAIT #4576511768: nam='control file parallel write' ela= 283 files=2 block#=116 requests=2 obj#=-1 tim=64067949675947 WAIT #4576511768: nam='control file sequential read' ela= 271 file#=0 block#=53 blocks=1 obj#=-1 tim=64067949676236 WAIT #4576511768: nam='control file parallel write' ela= 303 files=2 block#=52 requests=2 obj#=-1 tim=64067949676595 WAIT #4576511768: nam='control file sequential read' ela= 175 file#=0 block#=116 blocks=1 obj#=-1 tim=64067949676787 WAIT #4576511768: nam='control file sequential read' ela= 273 file#=0 block#=268 blocks=1 obj#=-1 tim=64067949677095 WAIT #4576511768: nam='control file parallel write' ela= 229 files=2 block#=116 requests=2 obj#=-1 tim=64067949677378 WAIT #4576511768: nam='control file sequential read' ela= 215 file#=0 block#=52 blocks=1 obj#=-1 tim=64067949677609 WAIT #4576511768: nam='control file parallel write' ela= 303 files=2 block#=267 requests=2 obj#=-1 tim=64067949678230 WAIT #4576511768: nam='control file sequential read' ela= 213 file#=0 block#=53 blocks=1 obj#=-1 tim=64067949678461 WAIT #4576511768: nam='control file parallel write' ela= 210 files=2 block#=52 requests=2 obj#=-1 tim=64067949678735 WAIT #4576511768: nam='control file sequential read' ela= 228 file#=0 block#=267 blocks=1 obj#=-1 tim=64067949678980 Incident 200841 created, dump file: /u01/app/oracle/diag/rdbms/hxlis/hxlis2/incident/incdir_200841/hxlis2_ora_9633892_i200841.trc ORA-00600: internal error code, arguments: [25015], [13], [9], [30], [], [], [], [], [], [], [], [] |
可以看到在连续的两个update file$和ts$后,开始读写控制文件之后报出了ORA-00600: internal error code, arguments: [25015], [13], [9], [30], [], [], [], [], [], [], [], []。这里猜测会不会是控制文件记录和数据字典记录不一致了呢?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
SQL> select count(*) from file$ where ts#=13; COUNT(*) ---------- 73 SQL> select count(*) from v$datafile where ts#=13; COUNT(*) ---------- 74 SQL> select file# from v$datafile where ts#=13 minus select file# from file$ where ts#=13; FILE# ---------- 30 |
发现果然不一致,正好file$少了一条记录并且file#是30,那么该600错误的第四个参数30的含义应该就是文件号,应该是有人去手动delete了file$删除了file$的记录。
知道原因的话处理起来就舒服多了,一般思路有四种。
- 闪回查询:通过闪回查询把删除的记录找回来,很不幸ORA-01555出现了。
- logminer:通过logminer找回delete file$的数据,但是并不知道是啥时候删除的。
- bbed:通过修改行头flag标识撤销delete操作,dump了block发现delete的行已经被覆盖了,因为客户又新建了100多个数据文件。
- 构造file#=30的记录插入file$,看来也只能用这种方法了。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
create table file$ /* file table */ ( file# number not null, /* file identifier number */ status$ number not null, /* status (see KTS.H): */ /* 1 = INVALID, 2 = AVAILABLE */ blocks number not null, /* size of file in blocks */ /* zero for bitmapped tablespaces */ ts# number, /* tablespace that owns file */ relfile# number, /* relative file number */ maxextend number, /* maximum file size */ inc number, /* increment amount */ crscnwrp number, /* creation SCN wrap */ crscnbas number, /* creation SCN base */ ownerinstance varchar("M_IDEN"), /* Owner instance name */ spare1 number, /* tablespace-relative DBA of space file header */ /* NULL for dictionary-mapped tablespaces */ spare2 number, spare3 varchar2(1000), spare4 date ) |
看了一下file$的定义,构造一条记录太简单了,找一个同表空间的数据文件记录,修改一下file#、relfile#、crscnwrp、crscnbas即可,create scn在控制文件中也有记录,所以可以轻松的构造一条file#=30的记录插入到file$中。
操作步骤大致如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
SYS@hxlis1 >create table t as select * from file$ where file#=29; Table created. SQL> select file#,creation_change# from v$datafile where file#=30; FILE# CREATION_CHANGE# -------------------- -------------------- 30 11816753253752 SYS@hxlis1 >select to_number('abf','xxxxxxxxxxxxxxxxxxxx'),to_number('4d614d78','xxxxxxxxxxxxxxxxxxxxxx') from dual; TO_NUMBER('ABF','XXXXXXXXXXXXXXXXXXXX') --------------------------------------- TO_NUMBER('4D614D78','XXXXXXXXXXXXXXXXXXXXXX') ---------------------------------------------- 2751 1298222456 SQL> update t set file#=30,relfile#=30,CRSCNWRP=2751,CRSCNBAS=1298220064; 1 row updated. SQL> commit; Commit complete. SQL> insert into file$ select * from t; 1 row created. SQL> commit; Commit complete. |
修改完成之后,为了保险起见,手动刷新两个节点的shared pool和db cache。
1 2 3 |
SYS@hxlis1 >DROP TABLESPACE "PROD" INCLUDING CONTENTS AND DATAFILES CASCADE CONSTRAINTS; Tablespace dropped. |
再次尝试删除成功。
强烈建议在不熟悉的情况下,不要轻易修改oracle的字典基表。本案例就是由于人为delete了file$记录引发的。