系統環境:
虛擬化平臺;華爲fusion sphere
操作系統平臺:RedHat
存儲:EMC unit-400
故障現象:應用部門反饋數據庫無法連接,提示監聽故障
故障分析:
1、登錄數據庫查看集羣資源狀態
crsctl status res -t
提示集羣資源異常,crs資源offline
2、lsblk卡住,無內容輸出
3、查看crsd.log,具體報錯信息如下圖所示:
2020-06-30 02:17:18.112: [ CRSD][2161772320] Logging level for Module: OCRASM 1
2020-06-30 02:17:18.112: [ CRSMAIN][2161772320] Checking the OCR device
2020-06-30 02:17:18.112: [ CRSMAIN][2161772320] Sync-up with OCR
2020-06-30 02:17:18.112: [ CRSMAIN][2161772320] Connecting to the CSS Daemon
2020-06-30 02:17:18.123: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD
2020-06-30 02:17:48.185: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD
2020-06-30 02:18:18.190: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD
2020-06-30 02:18:48.194: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD
2020-06-30 02:19:18.197: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD
2020-06-30 02:19:18.214: [ CSSCLNT][2161772320]clssscConnect: gipcWait failed with 16 (0x1a)
2020-06-30 02:19:18.214: [ CSSCLNT][2161772320]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_oracle-rac01_)) failed, rc 16
2020-06-30 02:19:18.218: [ CRSRTI][2161772320] CSS is not ready. Received status 3
2020-06-30 02:19:18.218: [ CRSD][2161772320] Created alert : (:CRSD00109:) : Could not init the CSS context, error: 3
2020-06-30 02:19:18.218: [ CRSD][2161772320][PANIC] CRSD exiting: Could not init the CSS context, error: 3
2020-06-30 02:19:18.218: [ CRSD][2161772320] Done.
2020-06-30 02:21:18.504: [ CSSCLNT][387819296]clssscConnect: gipcWait failed with 16 (0x1a)
2020-06-30 02:21:18.504: [ CSSCLNT][387819296]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_oracle-rac01_)) failed, rc 16
2020-06-30 02:21:18.508: [ CRSRTI][387819296] CSS is not ready. Received status 3
2020-06-30 02:21:18.508: [ CRSMAIN][387819296] First attempt: init CSS context failed. Error = 3
[ clsdmt][381368064]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=oracle-rac01DBG_CRSD))
2020-06-30 02:21:18.511: [ clsdmt][381368064]PID for the Process [23043], connkey 1
2020-06-30 02:21:18.512: [ clsdmt][381368064]Creating PID [23043] file for home /u01/11.2.0/grid host oracle-rac01 bin crs to /u01/11.2.0/grid/crs/init/
2020-06-30 02:21:18.512: [ clsdmt][381368064]Writing PID [23043] to the file [/u01/11.2.0/grid/crs/init/oracle-rac01.pid]
Crsd.log提示CSS進程不能不連接,處於不可以狀態
4、查看css日誌,具體日誌信息報錯如下:
2020-06-30 02:04:04.814: [ CSSD][2121881344]clssscMonitorThreads clssnmvWorkerThread not scheduled for 196870 msecs
2020-06-30 02:04:04.814: [ CSSD][2121881344]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 811354450 msecs
2020-06-30 02:04:04.814: [ CSSD][2121881344]clssscMonitorThreads clssnmvWorkerThread not scheduled for 811354540 msecs
2020-06-30 02:04:04.942: [ CSSD][2095437568]clssnmSendingThread: sending status msg to all nodes
2020-06-30 02:04:04.942: [ CSSD][2095437568]clssnmSendingThread: sent 5 status msgs to all nodes
2020-06-30 02:04:07.813: [ CSSD][2101761792](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 200460 ms for voting file /dev/asm-diskf)
2020-06-30 02:04:07.813: [ CSSD][2101761792]clssnmCompleteGMReq: Completed request type 17 with status 1
2020-06-30 02:04:07.813: [ CSSD][2101761792]clssgmDoneQEle: re-queueing req 0x7fba7653d510 status 1
2020-06-30 02:04:07.813: [ CSSD][2101761792]clssnmvDiskAvailabilityChange: voting file /dev/asm-diskf now offline
2020-06-30 02:04:07.813: [ CSSD][2101761792](:CSSNM00018:)clssnmvDiskCheck: Aborting, 1 of 3 configured voting disks available, need 2
2020-06-30 02:04:07.813: [ CSSD][2101761792]###################################
2020-06-30 02:04:07.813: [ CSSD][2101761792]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
2020-06-30 02:04:07.813: [ CSSD][2101761792]###################################
2020-06-30 02:04:07.813: [ CSSD][2101761792](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
Cssd.log日誌提示asm-diskf I/O等待超時,系統offline asm-diskf,導致css資源異常中斷
5、故障處理步驟
5.1 查看虛擬化和存儲相關聯磁盤,發現並沒有報錯日誌,仲裁盤應該是健康狀態。
5.2 嘗試重啓集羣資源,ocss進程無法關閉,crs集羣資源重啓失敗,懷疑係統管理進程處於無法通信狀態,操作系統進程異常
5.3 嘗試重啓操作系統,操作系統重啓完畢,集羣資源恢復正常,業務恢復
疑問:什麼原因導致vote盤offline還有待分析