vote 仲裁盘offline,导致cssd进程异常,节点集群资源无法使用

系统环境:

虚拟化平台;华为fusion sphere

操作系统平台:RedHat

存储:EMC unit-400

故障现象:应用部门反馈数据库无法连接,提示监听故障

故障分析:

1、登录数据库查看集群资源状态

crsctl status res -t

提示集群资源异常,crs资源offline

2、lsblk卡住,无内容输出

3、查看crsd.log,具体报错信息如下图所示:

2020-06-30 02:17:18.112: [    CRSD][2161772320] Logging level for Module: OCRASM  1

2020-06-30 02:17:18.112: [ CRSMAIN][2161772320] Checking the OCR device

2020-06-30 02:17:18.112: [ CRSMAIN][2161772320] Sync-up with OCR

2020-06-30 02:17:18.112: [ CRSMAIN][2161772320] Connecting to the CSS Daemon

2020-06-30 02:17:18.123: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD

2020-06-30 02:17:48.185: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD

2020-06-30 02:18:18.190: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD

2020-06-30 02:18:48.194: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD

2020-06-30 02:19:18.197: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD

2020-06-30 02:19:18.214: [ CSSCLNT][2161772320]clssscConnect: gipcWait failed with 16 (0x1a)

2020-06-30 02:19:18.214: [ CSSCLNT][2161772320]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_oracle-rac01_)) failed, rc 16

2020-06-30 02:19:18.218: [  CRSRTI][2161772320] CSS is not ready. Received status 3

2020-06-30 02:19:18.218: [    CRSD][2161772320] Created alert : (:CRSD00109:) :  Could not init the CSS context, error: 3

2020-06-30 02:19:18.218: [    CRSD][2161772320][PANIC] CRSD exiting: Could not init the CSS context, error: 3

2020-06-30 02:19:18.218: [    CRSD][2161772320] Done.

2020-06-30 02:21:18.504: [ CSSCLNT][387819296]clssscConnect: gipcWait failed with 16 (0x1a)

2020-06-30 02:21:18.504: [ CSSCLNT][387819296]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_oracle-rac01_)) failed, rc 16

2020-06-30 02:21:18.508: [  CRSRTI][387819296] CSS is not ready. Received status 3

2020-06-30 02:21:18.508: [ CRSMAIN][387819296] First attempt: init CSS context failed. Error = 3

[  clsdmt][381368064]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=oracle-rac01DBG_CRSD))

2020-06-30 02:21:18.511: [  clsdmt][381368064]PID for the Process [23043], connkey 1

2020-06-30 02:21:18.512: [  clsdmt][381368064]Creating PID [23043] file for home /u01/11.2.0/grid host oracle-rac01 bin crs to /u01/11.2.0/grid/crs/init/

2020-06-30 02:21:18.512: [  clsdmt][381368064]Writing PID [23043] to the file [/u01/11.2.0/grid/crs/init/oracle-rac01.pid]

Crsd.log提示CSS进程不能不连接,处于不可以状态

4、查看css日志,具体日志信息报错如下:

2020-06-30 02:04:04.814: [    CSSD][2121881344]clssscMonitorThreads clssnmvWorkerThread not scheduled for 196870 msecs

2020-06-30 02:04:04.814: [    CSSD][2121881344]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 811354450 msecs

2020-06-30 02:04:04.814: [    CSSD][2121881344]clssscMonitorThreads clssnmvWorkerThread not scheduled for 811354540 msecs

2020-06-30 02:04:04.942: [    CSSD][2095437568]clssnmSendingThread: sending status msg to all nodes

2020-06-30 02:04:04.942: [    CSSD][2095437568]clssnmSendingThread: sent 5 status msgs to all nodes

2020-06-30 02:04:07.813: [    CSSD][2101761792](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 200460 ms for voting file /dev/asm-diskf)

2020-06-30 02:04:07.813: [    CSSD][2101761792]clssnmCompleteGMReq: Completed request type 17 with status 1

2020-06-30 02:04:07.813: [    CSSD][2101761792]clssgmDoneQEle: re-queueing req 0x7fba7653d510 status 1

2020-06-30 02:04:07.813: [    CSSD][2101761792]clssnmvDiskAvailabilityChange: voting file /dev/asm-diskf now offline

2020-06-30 02:04:07.813: [    CSSD][2101761792](:CSSNM00018:)clssnmvDiskCheck: Aborting, 1 of 3 configured voting disks available, need 2

2020-06-30 02:04:07.813: [    CSSD][2101761792]###################################

2020-06-30 02:04:07.813: [    CSSD][2101761792]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread

2020-06-30 02:04:07.813: [    CSSD][2101761792]###################################

2020-06-30 02:04:07.813: [    CSSD][2101761792](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

Cssd.log日志提示asm-diskf I/O等待超时,系统offline asm-diskf,导致css资源异常中断

5、故障处理步骤

5.1 查看虚拟化和存储相关联磁盘,发现并没有报错日志,仲裁盘应该是健康状态。

5.2 尝试重启集群资源,ocss进程无法关闭,crs集群资源重启失败,怀疑系统管理进程处于无法通信状态,操作系统进程异常

5.3 尝试重启操作系统,操作系统重启完毕,集群资源恢复正常,业务恢复

疑问:什么原因导致vote盘offline还有待分析

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章