系统环境:
虚拟化平台;华为fusion sphere
操作系统平台:RedHat
存储:EMC unit-400
故障现象:应用部门反馈数据库无法连接,提示监听故障
故障分析:
1、登录数据库查看集群资源状态
crsctl status res -t
提示集群资源异常,crs资源offline
2、lsblk卡住,无内容输出
3、查看crsd.log,具体报错信息如下图所示:
2020-06-30 02:17:18.112: [ CRSD][2161772320] Logging level for Module: OCRASM 1
2020-06-30 02:17:18.112: [ CRSMAIN][2161772320] Checking the OCR device
2020-06-30 02:17:18.112: [ CRSMAIN][2161772320] Sync-up with OCR
2020-06-30 02:17:18.112: [ CRSMAIN][2161772320] Connecting to the CSS Daemon
2020-06-30 02:17:18.123: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD
2020-06-30 02:17:48.185: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD
2020-06-30 02:18:18.190: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD
2020-06-30 02:18:48.194: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD
2020-06-30 02:19:18.197: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD
2020-06-30 02:19:18.214: [ CSSCLNT][2161772320]clssscConnect: gipcWait failed with 16 (0x1a)
2020-06-30 02:19:18.214: [ CSSCLNT][2161772320]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_oracle-rac01_)) failed, rc 16
2020-06-30 02:19:18.218: [ CRSRTI][2161772320] CSS is not ready. Received status 3
2020-06-30 02:19:18.218: [ CRSD][2161772320] Created alert : (:CRSD00109:) : Could not init the CSS context, error: 3
2020-06-30 02:19:18.218: [ CRSD][2161772320][PANIC] CRSD exiting: Could not init the CSS context, error: 3
2020-06-30 02:19:18.218: [ CRSD][2161772320] Done.
2020-06-30 02:21:18.504: [ CSSCLNT][387819296]clssscConnect: gipcWait failed with 16 (0x1a)
2020-06-30 02:21:18.504: [ CSSCLNT][387819296]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_oracle-rac01_)) failed, rc 16
2020-06-30 02:21:18.508: [ CRSRTI][387819296] CSS is not ready. Received status 3
2020-06-30 02:21:18.508: [ CRSMAIN][387819296] First attempt: init CSS context failed. Error = 3
[ clsdmt][381368064]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=oracle-rac01DBG_CRSD))
2020-06-30 02:21:18.511: [ clsdmt][381368064]PID for the Process [23043], connkey 1
2020-06-30 02:21:18.512: [ clsdmt][381368064]Creating PID [23043] file for home /u01/11.2.0/grid host oracle-rac01 bin crs to /u01/11.2.0/grid/crs/init/
2020-06-30 02:21:18.512: [ clsdmt][381368064]Writing PID [23043] to the file [/u01/11.2.0/grid/crs/init/oracle-rac01.pid]
Crsd.log提示CSS进程不能不连接,处于不可以状态
4、查看css日志,具体日志信息报错如下:
2020-06-30 02:04:04.814: [ CSSD][2121881344]clssscMonitorThreads clssnmvWorkerThread not scheduled for 196870 msecs
2020-06-30 02:04:04.814: [ CSSD][2121881344]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 811354450 msecs
2020-06-30 02:04:04.814: [ CSSD][2121881344]clssscMonitorThreads clssnmvWorkerThread not scheduled for 811354540 msecs
2020-06-30 02:04:04.942: [ CSSD][2095437568]clssnmSendingThread: sending status msg to all nodes
2020-06-30 02:04:04.942: [ CSSD][2095437568]clssnmSendingThread: sent 5 status msgs to all nodes
2020-06-30 02:04:07.813: [ CSSD][2101761792](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 200460 ms for voting file /dev/asm-diskf)
2020-06-30 02:04:07.813: [ CSSD][2101761792]clssnmCompleteGMReq: Completed request type 17 with status 1
2020-06-30 02:04:07.813: [ CSSD][2101761792]clssgmDoneQEle: re-queueing req 0x7fba7653d510 status 1
2020-06-30 02:04:07.813: [ CSSD][2101761792]clssnmvDiskAvailabilityChange: voting file /dev/asm-diskf now offline
2020-06-30 02:04:07.813: [ CSSD][2101761792](:CSSNM00018:)clssnmvDiskCheck: Aborting, 1 of 3 configured voting disks available, need 2
2020-06-30 02:04:07.813: [ CSSD][2101761792]###################################
2020-06-30 02:04:07.813: [ CSSD][2101761792]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
2020-06-30 02:04:07.813: [ CSSD][2101761792]###################################
2020-06-30 02:04:07.813: [ CSSD][2101761792](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
Cssd.log日志提示asm-diskf I/O等待超时,系统offline asm-diskf,导致css资源异常中断
5、故障处理步骤
5.1 查看虚拟化和存储相关联磁盘,发现并没有报错日志,仲裁盘应该是健康状态。
5.2 尝试重启集群资源,ocss进程无法关闭,crs集群资源重启失败,怀疑系统管理进程处于无法通信状态,操作系统进程异常
5.3 尝试重启操作系统,操作系统重启完毕,集群资源恢复正常,业务恢复
疑问:什么原因导致vote盘offline还有待分析