本案例来自南区某客户,环境为exadata,数据库版本rac 11.2.0.4。在并行创建索引时,大概执行5分钟之后就会报错ORA-12805,据现场人员描述,无论并行度指定为多少,都会报错。
mos中有一篇文档How to Diagnose ORA-12805 Parallel Query Server Died Unexpectedly Error (Doc ID 1348002.1) ,详细记录了ORA-12805的诊断方法。
1 2 3 4 |
alter session set MAX_DUMP_FILE_SIZE= unlimited; alter session set TRACEFILE_IDENTIFIER = 'ORA12805' alter session set "_px_trace"=medium, execution, medium, messaging, time; alter system set events = '12805 trace name ERRORSTACK level 3'; |
设置px trace和12805的errorstack之后,复现错误,分析报错的trace文件。
可以找到报错的语句为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
*** 2023-01-21 22:30:04.803 dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=3, mask=0x0) ----- Error Stack Dump ----- ORA-12805: parallel query server died unexpectedly ----- Current SQL Statement for this session (sql_id=dm52tctzjbpak) ----- create unique index mesmgr.CWIPASYINV_PKP on MESMGR.CWIPASYINV (FACTORY, LOT_ID, INV_LOT_ID, BOARD_SIDES) tablespace SMT_TS_IDX parallel 8 ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- skdstdst()+41 call kgdsdst() 000000000 ? 000000000 ? 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? ksedst1()+103 call skdstdst() 000000000 ? 000000000 ? 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? ksedst()+39 call ksedst1() 000000000 ? 000000001 ? 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? dbkedDefDump()+2746 call ksedst() 000000000 ? 000000001 ? 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? ksedmp()+41 call dbkedDefDump() 000000003 ? 000000000 ? 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? dbkdaKsdActDriver() call ksedmp() 000000003 ? 000000000 ? +1960 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? dbgdaExecuteAction( call dbkdaKsdActDriver() 7F351CF04730 ? 7FFDD43F4560 ? )+1065 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? dbgdaRunAction()+81 call dbgdaExecuteAction( 7F351CF04730 ? 00A80CE00 ? 5 ) 0020C0003 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? dbgdRunActions()+64 call dbgdaRunAction() 7F351CF04730 ? 000000005 ? 0020C0003 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? dbgdProcessEventAct call dbgdRunActions() 7F351CF04730 ? 000000005 ? ions()+656 0020C0003 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? dbgdChkEventKgErr() call dbgdProcessEventAct 7F351CF04730 ? 00C35D8C0 ? +1658 ions() 7F3519B005A0 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? dbkdChkEventRdbmsEr call dbgdChkEventKgErr() 7F351CF04730 ? 00C35D8C0 ? r()+61 7F3519B005A0 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? ksfpec()+66 call dbkdChkEventRdbmsEr 7F351CF04730 ? 00C35D8C0 ? r() 7F3519B005A0 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? dbgePostErrorKGE()+ call ksfpec() 7F351CF04730 ? 00C35D8C0 ? 1137 7F3519B005A0 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? dbkePostKGE_kgsf()+ call dbgePostErrorKGE() 00C35D8C0 ? 7F351CF0D9F0 ? 71 000003205 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? kgeselv()+276 call dbkePostKGE_kgsf() 00C35D8C0 ? 7F351CF0D9F0 ? 000003205 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? ksesecl0()+162 call kgeselv() 00C35D8C0 ? 7F351CF0D9F0 ? 000003205 ? 7FFDD43F4560 ? 000000001 ? 000000000 ? ksucin()+147 call ksesecl0() 00C35D8C0 ? 7F351CF0D9F0 ? 000003205 ? 000000000 ? 0000000B0 ? 000000C51 ? kxfpqidqr()+14184 call ksucin() 00C35D8C0 ? 7F351CF0D9F0 ? 000003205 ? 000000000 ? 0000000B0 ? 000000C51 ? kxfpqdqr()+367 call kxfpqidqr() 00C35D8C0 ? 7F351CF0D9F0 ? 000003205 ? 000000000 ? 0000000B0 ? 000000C51 ? kxfxgs()+82 call kxfpqdqr() 138FFD8120 ? 000000000 ? 00000001B ? 000000000 ? 0000000B0 ? 000000C51 ? kxfxcw()+281 call kxfxgs() 000000000 ? 000000000 ? 7FFDD43F6060 ? 000000000 ? 0000000B0 ? 000000C51 ? qerpxFetch()+2368 call kxfxcw() 7F3519B12B20 ? 7FFDD43F6060 ? 7FFDD43F6058 ? 000000000 ? 000000000 ? 7FFD00000001 ? kdicrws()+2005 call qerpxFetch() 7F3519B12B20 ? 7FFDD43F6060 ? 7FFDD43F6058 ? 000000000 ? 000000000 ? 7FFD00000001 ? kdicdrv()+492 call kdicrws() 7F3519B12B20 ? 7FFDD43F6060 ? 7FFDD43F6058 ? 000000000 ? 000000000 ? 7FFD00000001 ? opiexe()+22672 call kdicdrv() 12CCEF1088 ? 7FFDD43F6060 ? 7FFDD43F84A0 ? 000000000 ? 000000000 ? 7FFD00000001 ? opiosq0()+4310 call opiexe() 000000004 ? 000000000 ? 7FFDD43F84A0 ? 000000000 ? 000000000 ? 7FFD00000001 ? kpooprx()+274 call opiosq0() 000000003 ? 00000000E ? 7FFDD43F8EF0 ? 0000000A4 ? 000000000 ? 7FFD00000001 ? kpoal8()+842 call kpooprx() 7FFDD43FC6D4 ? 7FFDD43FA690 ? 00000008A ? 000000001 ? 000000000 ? 7FFD00000001 ? opiodr()+915 call kpoal8() 00000005E ? 7FFDD43FA690 ? 00000008A ? 000000001 ? 000000000 ? 7FFD00000001 ? ttcpip()+2773 call opiodr() 00000005E ? 00000001C ? 7FFDD43FC6D0 ? 000000001 ? 000000000 ? 7FFD00000001 ? opitsk()+1705 call ttcpip() 00C37B510 ? 009B99F90 ? 7FFDD43FC6D0 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? opiino()+969 call opitsk() 00C37B518 ? 000000000 ? 7FFDD43FC6D0 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? opiodr()+915 call opiino() 00000003C ? 000000004 ? 7FFDD43FDEC8 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? opidrv()+570 call opiodr() 00000003C ? 000000004 ? 7FFDD43FDEC8 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? sou2o()+103 call opidrv() 00000003C ? 000000004 ? 7FFDD43FDEC8 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? opimai_real()+133 call sou2o() 7FFDD43FDEA0 ? 00000003C ? 000000004 ? 7FFDD43FDEC8 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? ssthrdmain()+265 call opimai_real() 000000002 ? 7FFDD43FE090 ? 000000004 ? 7FFDD43FDEC8 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? main()+201 call ssthrdmain() 000000002 ? 7FFDD43FE090 ? 000000001 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? __libc_start_main() call main() 000000002 ? 7FFDD43FE238 ? +245 000000001 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? _start()+41 call __libc_start_main() 000A9F990 ? 000000002 ? 7FFDD43FE228 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? |
继续分析qc进程的trace文件还可以找到比较关键的信息:
1 2 3 4 |
2023-01-21 22:30:06.043945 :PX_Messaging:kxfp.c@3324:kxfpqsod(begin): q=0x138ffd8120 qser=857089 action=0x2 flg=0x4402 2023-01-21 22:30:06.043965 :PX_Messaging:kxfp.c@2921:kxfpqsod_qc_sod(): q=0x138ffd8120 qser=857089 qinc=0 action=0x2 flg=0x4402 Process received 12805 signal due to P019, instance dm03dbadm01.adc.com:SMTDB1 (1) |
qc进程报错ORA-12805的原因是因为px slave进程P019报错了。
继续分析P019进程的trace,发现报错为28
1 2 3 4 5 6 7 8 9 10 |
*** 2023-01-21 22:30:04.733 2023-01-21 22:30:04.733287 :PX_Control:kxfx.c@7901:kxfxmai(): In signal-handler after catching 28 2023-01-21 22:30:04.733329 :PX_Control:kxfx.c@4404:kxfxsuf(): kxfx slave finish kxfxsSmtClose [ 317370/317280] stmt:0x7fc04bc51908 UGA: InUse:959264KB Alloc:959264KB PGA: InUse:2190KB Alloc:2375KB Call 0x7fc04bc51258 popped 2023-01-21 22:30:04.803193 :PX_Messaging:kxfp.c@12411:kxfprdp(): Exiting parallel client parallel query execution(4) on error=28 |
1 2 3 4 5 |
[oracle@rac1 ~]$ oerr ora 28 00028, 00000, "your session has been killed" // *Cause: A privileged user has killed your session and you are no longer // logged on to the database. // *Action: Login again if you wish to continue working. |
28的意思是session被kill了,难道px slave被kill了?
疑似Bug 23717927 : ORA-12805: PARALLEL QUERY SERVER DIED UNEXPECTEDLY
1 2 |
Receive a 12805 signal from a slave process. eg. Slave get err=28 and then send 12805 to QC |