该案例来自北区某客户,6节点rac,数据库版本为11.2.0.4,客户的rac环境并没有配置NTP server,而是使用RAC自带的时间同步组件CTSS来做的集群节点间的时钟同步。
据客户与一线同事描述,现象为节点1和节点2时钟同步正常,但是3-6节点无法时钟同步。这个问题已经困扰了客户很长一段时间了。CTSS的时钟同步,会选择一个参考节点,其他节点的时钟都会依据参考节点的时间进行微调。
先来看看异常的节点3的gi alert日志:
1 2 3 4 |
2023-03-15 16:54:49.297: [ctssd(119933)]CRS-2401:The Cluster Time Synchronization Service started on host core-js3. 2023-03-15 16:54:49.297: [ctssd(119933)]CRS-2407:The new Cluster Time Synchronization Service reference node is host core-js1. |
从gi alert可以看到此时rac的ctss参考节点为节点1。
3节点ctss日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
Msg NOT meant for this member. This member id (3, 1718544606). Dest in Msg (3, 4294802950). Message dropped. 2023-03-16 10:50:32.856: [ CRSCCL][3769472768] **** Connection NOT meant for this member. Dest in Msg (3, 4294802950) 2023-03-16 10:50:32.856: [GIPCXCPT][3769472768] gipcInternalSend: cannot send empty buffer buf 0x7fa9c800c140, len 0, ret gipcretFail (1) 2023-03-16 10:50:32.856: [GIPCXCPT][3769472768] gipcSendF [clsCclGipcSend : clsCclCommHandler.c : 3756]: EXCEPTION[ ret gipcretFail (1) ] failed to send on endp 0x7fa9cc57c0e0 [00000000000538de] { gipcEndpoint : localAddr 'gipcha://core-js3:CTSSGROUP_3/af57-c2f1-bdb7-8cc', remoteAddr 'gipcha://core-js1:6208-4b90-6a5f-02e', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 81272, readyRef (nil), ready 0, wobj 0x7fa9cc59cb40, sendp (nil)flags 0x138606, usrFlags 0x20 }, addr 0000000000000000, buf 0x7fa9c800c140, len 0, cookie (nil), flags 0x0 2023-03-16 10:50:32.856: [ CRSCCL][3769472768]gipcSend failed. rc= 1. 2023-03-16 10:50:32.856: [ CRSCCL][3769472768]Failed to send new connection deny msg. 2023-03-16 10:50:33.857: [GIPCHAUP][3776157440] gipchaUpperDisconnect: initiated discconnect umsg 0x7fa9cc523880 { msg 0x7fa9cc58ef68, ret gipcretRequestPending (15), flags 0x2 }, msg 0x7fa9cc58ef68 { type gipchaMsgTypeDisconnect (5), srcCid 00000000-000538d3, dstCid 00000000-0ebca523 }, endp 0x7fa9cc58e010 [00000000000538d3] { gipchaEndpoint : port 'CTSSGROUP_3/af57-c2f1-bdb7-8cc2', peer 'core-js1:6208-4b90-6a5f-02ec', srcCid 00000000-000538d3, dstCid 00000000-0ebca523, numSend 0, maxSend 100, groupListType 2, hagroup 0x1079f90, usrFlags 0x4000, flags 0x21c } 2023-03-16 10:50:33.858: [GIPCHAUP][3776157440] gipchaUpperCallbackDisconnect: completed DISCONNECT ret gipcretSuccess (0), umsg 0x7fa9cc523880 { msg 0x7fa9cc58ef68, ret gipcretSuccess (0), flags 0x2 }, msg 0x7fa9cc58ef68 { type gipchaMsgTypeDisconnect (5), srcCid 00000000-000538d3, dstCid 00000000-0ebca523 }, hendp 0x7fa9cc58e010 [00000000000538d3] { gipchaEndpoint : port 'CTSSGROUP_3/af57-c2f1-bdb7-8cc2', peer 'core-js1:6208-4b90-6a5f-02ec', srcCid 00000000-000538d3, dstCid 00000000-0ebca523, numSend 0, maxSend 100, groupListType 2, hagroup 0x1079f90, usrFlags 0x4000, flags 0x21c } 2023-03-16 10:50:39.519: [GIPCXCPT][3763169024] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'core-js3', port '8588-94d7-4477-b30a', hctx 0x104ea40 [0000000000000010] { gipchaContext : host 'core-js3', name '57d6-c579-303e-038b', luid '4b08e6b2-00000000', numNode 1, numInf 2, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36) 2023-03-16 10:50:39.519: [GIPCHGEN][3763169024] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 815]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 0x104ea40 [0000000000000010] { gipchaContext : host 'core-js3', name '57d6-c579-303e-038b', luid '4b08e6b2-00000000', numNode 1, numInf 2, usrFlags 0x0, flags 0x5 }, host 'core-js3', port '8588-94d7-4477-b30a', flags 0x0 2023-03-16 10:50:39.519: [GIPCHAUP][3776157440] gipchaUpperDisconnect: initiated discconnect umsg 0x7fa9cc5045c0 { msg 0x7fa9cc57a788, ret gipcretRequestPending (15), flags 0x2 }, msg 0x7fa9cc57a788 { type gipchaMsgTypeDisconnect (5), srcCid 00000000-00053886, dstCid 00000000-0ebca3e8 }, endp 0x7fa9c0014af0 [0000000000053886] { gipchaEndpoint : port 'd707-f99e-f6ed-759a', peer 'core-js1:CTSSGROUP_1/ca26-4aaf-b16c-efcf', srcCid 00000000-00053886, dstCid 00000000-0ebca3e8, numSend 0, maxSend 100, groupListType 2, hagroup 0x1079f90, usrFlags 0x4000, flags 0x21c } 2023-03-16 10:50:39.519: [ CRSCCL][3769472768]GW: No record of this endp= 0x53871 2023-03-16 10:50:39.519: [ GIPCLIB][3769472768] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:0x1016010, idxPtr:0x1020720, key:0x7fa9e0ad8550, flags:0x0 2023-03-16 10:50:39.519: [GIPCXCPT][3769472768] gipcObjectLookupF [gipcDissociateF : gipc.c : 2230]: search found no matching oid 0000000000053871, ret gipcretKeyNotFound (36), ret gipcretInvalidObject (3) 2023-03-16 10:50:39.519: [GIPCXCPT][3769472768] gipcDissociateF [clsCclGipcDisconnect : clsCclCommHandler.c : 4031]: EXCEPTION[ ret gipcretInvalidObject (3) ] failed to dissociate obj 0000000000053871, flags 0x0 2023-03-16 10:50:39.519: [ CRSCCL][3769472768]clsCclGipcDisconnect: gipcDissociate() failed. rc= 3. 2023-03-16 10:50:39.519: [ GIPCLIB][3769472768] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:0x1016010, idxPtr:0x1020720, key:0x7fa9e0ad8550, flags:0x0 2023-03-16 10:50:39.519: [GIPCXCPT][3769472768] gipcObjectLookupF [gipcDisconnectF : gipc.c : 1572]: search found no matching oid 0000000000053871, ret gipcretKeyNotFound (36), ret gipcretInvalidObject (3) 2023-03-16 10:50:39.519: [GIPCTRAC][3769472768] gipcDisconnectF [clsCclGipcDisconnect : clsCclCommHandler.c : 4037]: EXCEPTION[ ret gipcretInvalidObject (3) ] failed disconnect endp 0000000000053871, flags 0x0 2023-03-16 10:50:39.520: [ GIPCLIB][3769472768] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:0x1016010, idxPtr:0x1020720, key:0x7fa9e0ad8560, flags:0x0 2023-03-16 10:50:39.520: [GIPCXCPT][3769472768] gipcObjectLookupF [gipcDestroyF : gipc.c : 2982]: search found no matching oid 0000000000053871, ret gipcretKeyNotFound (36), ret gipcretInvalidObject (3) 2023-03-16 10:50:39.520: [GIPCXCPT][3769472768] gipcDestroyF [clsCclGipcDisconnect : clsCclCommHandler.c : 4046]: EXCEPTION[ ret gipcretInvalidObject (3) ] failure to destroy obj 0000000000053871, flags 0x0 2023-03-16 10:50:39.520: [GIPCHAUP][3776157440] gipchaUpperCallbackDisconnect: completed DISCONNECT ret gipcretSuccess (0), umsg 0x7fa9cc5045c0 { msg 0x7fa9cc57a788, ret gipcretSuccess (0), flags 0x2 }, msg 0x7fa9cc57a788 { type gipchaMsgTypeDisconnect (5), srcCid 00000000-00053886, dstCid 00000000-0ebca3e8 }, hendp 0x7fa9c0014af0 [0000000000053886] { gipchaEndpoint : port 'd707-f99e-f6ed-759a', peer 'core-js1:CTSSGROUP_1/ca26-4aaf-b16c-efcf', srcCid 00000000-00053886, dstCid 00000000-0ebca3e8, numSend 0, maxSend 100, groupListType 2, hagroup 0x1079f90, usrFlags 0x4000, flags 0x21c } 2023-03-16 10:50:50.371: [ CTSS][3784607488]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0x84], offset[0 ms]}, length=[8]. 2023-03-16 10:51:03.861: [GIPCXCPT][3776157440] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'core-js3', port 'b5b6-c3d5-47e6-d678', hctx 0x104ea40 [0000000000000010] { gipchaContext : host 'core-js3', name '57d6-c579-303e-038b', luid '4b08e6b2-00000000', numNode 1, numInf 2, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36) 2023-03-16 10:51:03.861: [GIPCHGEN][3776157440] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 815]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 0x104ea40 [0000000000000010] { gipchaContext : host 'core-js3', name '57d6-c579-303e-038b', luid '4b08e6b2-00000000', numNode 1, numInf 2, usrFlags 0x0, flags 0x5 }, host 'core-js3', port 'b5b6-c3d5-47e6-d678', flags 0x0 2023-03-16 10:51:03.862: [GIPCXCPT][3776157440] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'core-js3', port '09ad-25a6-7338-b805', hctx 0x104ea40 [0000000000000010] { gipchaContext : host 'core-js3', name '57d6-c579-303e-038b', luid '4b08e6b2-00000000', numNode 1, numInf 2, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36) 2023-03-16 10:51:03.862: [GIPCHGEN][3776157440] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 815]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 0x104ea40 [0000000000000010] { gipchaContext : host 'core-js3', name '57d6-c579-303e-038b', luid '4b08e6b2-00000000', numNode 1, numInf 2, usrFlags 0x0, flags 0x5 }, host 'core-js3', port '09ad-25a6-7338-b805', flags 0x0 2023-03-16 10:51:03.862: [ CRSCCL][3769472768]clsCclNewConn: added new conn to tempConList: newPeerCon = c800af60 2023-03-16 10:51:03.862: [ CRSCCL][3769472768] **** |
从ctss日志可以看出,ctss服务是依赖gipc进行通信的,并且可以看到大量的send报错,remote地址为节点1。同时也有一些drop、disconnect关键字。初步怀疑是防火墙问题。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# For RAC HAIP -A INPUT -s 169.254.0.0/16 -j ACCEPT # For RAC and Storage -A INPUT -m iprange --src-range 10.xxx.xxx.xxx-10.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 10.xxx.xxx.xxx-10.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 10.xxx.xxx.xxx-10.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT xxx.xxx -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT -A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT |
检查防火墙配置规则,可以看到私网ip以及haip都是放开的。排除防护墙问题。
3节点gipc日志:
1 2 3 4 5 6 7 8 |
2023-03-16 15:39:55.874: [GIPCDCLT][3716376320] gipcdClientThread: req from local client of type gipcdmsgtypeLookupName, endp 0000000000000323 2023-03-16 15:39:55.874: [GIPCDCLT][3716376320] gipcdClientLookupName: Received type(gipcdmsgtypeLookupName), endp(0000000000000323), len(1032), buf(0x7f6bd417b4f8):[hostname(core-js3), portstr: (afd8-f8f3-4b7d-938d), haname(), retStatus(gipcretSuccess)] 2023-03-16 15:39:55.874: [ GIPCLIB][3716376320] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:0x7f6bd402eb50, idxPtr:0x7f6bd4048ef0, key:0x7f6bdd834b90, flags:0x0 2023-03-16 15:39:55.874: [GIPCDCLT][3716376320] gipcdClientThread: Sent req to local client on endp 0000000000000323:(type(gipcdmsgtypeLookupNameAck), endp(0000000000000323), len(1032), buf(0x7f6bd4224ac0):[hostname(core-js3), portstr: (afd8-f8f3-4b7d-938d), haname(), retStatus(gipcretKeyNotFound)]) 2023-03-16 15:39:55.874: [GIPCDCLT][3716376320] gipcdClientThread: req from local client of type gipcdmsgtypeLookupName, endp 0000000000000323 2023-03-16 15:39:55.874: [GIPCDCLT][3716376320] gipcdClientLookupName: Received type(gipcdmsgtypeLookupName), endp(0000000000000323), len(1032), buf(0x7f6bd4224b48):[hostname(core-js3), portstr: (ae83-aa31-1cf0-f32e), haname(), retStatus(gipcretSuccess)] 2023-03-16 15:39:55.874: [ GIPCLIB][3716376320] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:0x7f6bd402eb50, idxPtr:0x7f6bd4040d90, key:0x7f6bdd834b90, flags:0x0 2023-03-16 15:39:55.874: [GIPCDCLT][3716376320] gipcdClientThread: Sent req to local client on endp 0000000000000323:(type(gipcdmsgtypeLookupNameAck), endp(0000000000000323), len(1032), buf(0x7f6bd417b470):[hostname(core-js3), portstr: (ae83-aa31-1cf0-f32e), haname(), retStatus(gipcretKeyNotFound)]) |
gipc异常,尝试kill 3节点gipc进程没有作用。并且通过watch+netstat -s观察网络统计的异常指标,并未发现有增长。
继续分析1节点ctss日志,只截取关键信息:
1 |
2023-03-16 14:29:04.873: [ CTSS][2797025024]ctsscomm_msg_hndlr: Received from slave ( mode [0xc4] nodenum [2] hostname [core-js2] ) |
只成功接收到2节点的信息。
1 2 3 4 5 6 |
2023-03-16 14:29:13.365: [GIPCHAUP][2805810944] gipchaUpperProcessDisconnect: processing DISCONNECT for hendp 0x7fb6880d5d10 [000000000ebfc82c] { gipchaEndpoint : port 'fa0a-4588-b3f1-3d14', peer 'core-js3:CTSSGROUP_3/ea24-d1e7-19ff-c645', srcCid 00000000-0ebfc82c, dstCid 00000000-00064b04, numSend 0, maxSend 100, groupListType 1, hagroup 0x1654f90, usrFlags 0x4000, flags 0x204 } 2023-03-16 14:29:13.716: [GIPCHAUP][2805810944] gipchaUpperProcessDisconnect: processing DISCONNECT for hendp 0x7fb623b3e150 [000000000ebfc6ab] { gipchaEndpoint : port 'CTSSGROUP_1/7879-176e-8975-a49d', peer 'core-js5:3c81-0971-ff00-3b4c', srcCid 00000000-0ebfc6ab, dstCid 00000000-00f82fbc, numSend 0, maxSend 100, groupListType 1, hagroup 0x1654f90, usrFlags 0x4000, flags 0x204 } 2023-03-16 14:29:22.945: [GIPCHAUP][2805810944] gipchaUpperProcessDisconnect: processing DISCONNECT for hendp 0x7fb623b08b10 [000000000ebfc7a1] { gipchaEndpoint : port 'CTSSGROUP_1/073a-7810-bae1-426a', peer 'core-js4:388e-ae87-6944-0202', srcCid 00000000-0ebfc7a1, dstCid 00000000-02848bb3, numSend 0, maxSend 100, groupListType 1, hagroup 0x1654f90, usrFlags 0x4000, flags 0x204 } 2023-03-16 14:29:18.616: [ CRSCCL][2799126272]PNC: Waiting for peer join from grpstat for peer (3,1718544606). 2023-03-16 14:29:20.327: [GIPCHAUP][2805810944] gipchaUpperProcessDisconnect: processing DISCONNECT for hendp 0x7fb623b5b3a0 [000000000ebfc755] { gipchaEndpoint : port 'CTSSGROUP_1/4e5d-8fc4-b709-d718', peer 'core-js6:d1a4-904c-bd8f-3cb1', srcCid 00000000-0ebfc755, dstCid 00000000-00f83f34, numSend 0, maxSend 100, groupListType 1, hagroup 0x1654f90, usrFlags 0x4000, flags 0x204 } 2023-03-16 14:29:20.328: [ CRSCCL][2799126272]clsCclGipcWait: GW: Marking Disconnected on temp con list. |
但是3-6节点就很不幸了,全部disconect了。这时判断是节点1的gipc出现了问题。kill 1节点gipc之后,CTSS参考节点变成了2节点,此时所有节点时钟同步恢复正常
但是这里有一个疑问,就是为何唯独2节点是同步正常的呢?据后来询问客户,之前是2节点的rac,只有节点1和节点2,后续通过加节点的方式扩展到了6节点的rac,但是3-6节点时钟同步一直不正常。
猜测是,使用CTSS方式同步时钟,如果新增节点,新增的节点的CTSS组件通过gipc与CTSS参考节点通信时会出现异常,导致时钟同步无法正常工作。