oracle_Grid Infrastructure 啓動的五大問題

Applies to:

Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.

Purpose

本文檔的目的是總結可能阻止 Grid Infrastructure (GI) 成功啓動的 5 大問題。

Scope

本文檔僅適用於 11gR2 Grid Infrastructure。

要確定 GI 的狀態，請運行以下命令：

1. $GRID_HOME/bin/crsctl check crs
2. $GRID_HOME/bin/crsctl stat res -t -init
3. $GRID_HOME/bin/crsctl stat res -t
4. ps -ef | egrep 'init|d.bin'

Details

問題 1：CRS-4639：無法連接 Oracle 高可用性服務，ohasd.bin 未運行或 ohasd.bin 雖在運行但無 init.ohasd 或其他進程

症狀：

1. 命令“$GRID_HOME/bin/crsctl check crs”返回錯誤：
     CRS-4639: Could not contact Oracle High Availability Services
2. 命令“ps -ef | grep init”不顯示類似於如下所示的行：
     root 4878 1 0 Sep12 ? 00:00:02 /bin/sh /etc/init.d/init.ohasd run
3. 命令“ps -ef | grep d.bin”不顯示類似於如下所示的行：
     root 21350 1 6 22:24 ? 00:00:01 /u01/app/11.2.0/grid/bin/ohasd.bin reboot
   或者它只顯示 "ohasd.bin reboot" 進程而沒有其他進程

可能的原因：

1. 文件“/etc/inittab”並不包含行
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 2. 未達到運行級別 3，一些 rc3 腳本掛起
3. Init 進程 (pid 1) 並未衍生 /etc/inittab (h1) 中定義的進程，或 init.ohasd 之前的不當輸入，如 xx:wait: 阻礙了 init.ohasd 的啓動
4. CRS 自動啓動已禁用
5. Oracle 本地註冊表 ($GRID_HOME/cdata/.olr) 丟失或損壞

解決方案：

1. 將以下行添加至 /etc/inittab
    h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1    並以 root 用戶身份運行“init q”。
2. 運行命令“ps -ef | grep rc”，並kill看起來受阻的所有 rc3 腳本。
3. 刪除 init.ohasd 前的不當輸入。如果“init q”未衍生“init.ohasd run”進程，請諮詢 OS 供應商
4. 啓用 CRS 自動啓動：
   # crsctl enable crs
   # crsctl start crs
5. 以 root 用戶身份從備份中恢復 OLR（Oracle 本地註冊表）：
   # touch $GRID_HOME/cdata/.olr
# chown root:oinstall $GRID_HOME/cdata/.olr
# ocrconfig -local -restore$GRID_HOME/cdata//backup__.olr
# crsctl start crs

如果出於某種原因，OLR 備份不存在，要重建 OLR 就需要以 root 用戶身份執行 deconfig 並重新運行 root.sh：
   # $GRID_HOME/crs/install/rootcrs.pl -deconfig -force
   # $GRID_HOME/root.sh

問題 2：CRS-4530：聯繫集羣同步服務守護進程時出現通信故障，ocssd.bin 未運行

症狀：

1. 命令“$GRID_HOME/bin/crsctl check crs”返回錯誤：
    CRS-4638: Oracle High Availability Services is online
    CRS-4535: Cannot communicate with Cluster Ready Services
    CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
    CRS-4534: Cannot communicate with Event Manager
2. 命令“ps -ef | grep d.bin”不顯示類似於如下所示的行：
    oragrid 21543 1 1 22:24 ? 00:00:01 /u01/app/11.2.0/grid/bin/ocssd.bin
3. ocssd.bin 正在運行，但在 ocssd.log 中顯示消息“CLSGPNP_CALL_AGAIN”後又中止運行
4. ocssd.log 顯示如下內容：

2012-01-27 13:42:58.796: [ CSSD][19]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 223132864, wrtcnt, 1112, LATS 783238209,
lastSeqNo 1111, uniqueness 1327692232, timestamp 1327693378/787089065

5. 對於 3 個或更多節點的情況，2 個節點形成的集羣一切正常，但是，當第 3 個節點加入時就出現故障，ocssd.log 顯示如下內容：

   2012-02-09 11:33:53.048: [ CSSD][1120926016](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 2 nodes with leader 2, racnode2, is smaller than
   cohort of 2 nodes led by node 1, racnode1, based on map type 2
   2012-02-09 11:33:53.048: [ CSSD][1120926016]###################################
   2012-02-09 11:33:53.048: [ CSSD][1120926016]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread

6. 10 分鐘後 ocssd.bin 啓動超時

   2012-04-08 12:04:33.153: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1333911873
   ......
   2012-04-08 12:14:31.994: [ CSSD][5]clssgmShutDown: Received abortive shutdown request from client.
   2012-04-08 12:14:31.994: [ CSSD][5]###################################
   2012-04-08 12:14:31.994: [ CSSD][5]clssscExit: CSSD aborting from thread GMClientListener
   2012-04-08 12:14:31.994: [ CSSD][5]###################################
   2012-04-08 12:14:31.994: [ CSSD][5](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

可能的原因：

1. 表決磁盤丟失或無法訪問
2. 多播未正常工作（對於 11.2.0.2 及以上版本）
3. 私網未工作，ping 或 traceroute 顯示無法訪問目標。或雖然 ping/traceroute 正常工作，但是在私網中啓用了防火牆
4. 使用正常 ping 命令可對私網進行 ping 操作，但啓用巨幀時（MTU：9000+），不能使用巨幀尺寸（如：ping -s 8900 ）進行 ping 操作。或部分集羣節點設置了巨幀（MTU：9000），但問題節點未設置巨幀（MTU：1500）
5. gpnpd 未出現，卡在 dispatch 線程中， Bug 10105195
6. 通過 asm_diskstring 發現的磁盤太多，或由於 Bug 13454354 導致掃描太慢（僅在 Solaris 11.2.0.3 上出現）

解決方案：

1. 通過檢查存儲存取性、磁盤權限等恢復表決磁盤存取。
   如果 OCR ASM 磁盤組中的 voting disk已經丟失，以獨佔模式啓動 CRS，並重建表決磁盤：
   # crsctl start crs -excl
   # crsctl replace votedisk <+OCRVOTE diskgroup>
2. 請參考 Document 1212703.1 ，瞭解多播功能的測試及修正
3. 諮詢網絡管理員，恢復私網訪問或禁用私網防火牆（對於 Linux，請檢查服務 iptables 狀態和服務 ip6tables 狀態）
4. 如果巨幀在網卡中啓用，則聯繫網絡管理員在交換機層也啓用。
5. 終止正常運行節點上的 gpnpd.bin 進程，請參考 Document 10105195.8
   一旦以上問題得以解決，請重新啓動 Grid Infrastructure。
   如果 ping/traceroute 對私網均可用，但是問題發生在從 11.2.0.1 至 11.2.0.2 升級過程中，請檢查
   Bug 13416559 獲取解決方法。
6. 通過提供更加具體的 asm_diskstring，限制 ASM 掃描磁盤的數量，請參考 bug 13583387
   對於 Solaris 11.2.0.3，請應用補丁 13250497，請參閱 Document 1451367.1.

問題 3：CRS-4535：無法與集羣就緒服務通信，crsd.bin 未運行

症狀：

1. 命令“$GRID_HOME/bin/crsctl check crs”返回錯誤：
    CRS-4638: Oracle High Availability Services is online
    CRS-4535: Cannot communicate with Cluster Ready Services
    CRS-4529: Cluster Synchronization Services is online
    CRS-4534: Cannot communicate with Event Manager
2. 命令“ps -ef | grep d.bin”不顯示類似於如下所示的行：
    root 23017 1 1 22:34 ? 00:00:00 /u01/app/11.2.0/grid/bin/crsd.bin reboot
3. 即使存在 crsd.bin 進程，命令“crsctl stat res -t –init”仍然顯示：
    ora.crsd
        1    ONLINE     INTERMEDIATE

可能的原因：

1. ocssd.bin 未運行，或資源 ora.cssd 不在線
2. +ASM 實例無法啓動
3. OCR 無法訪問
4. 網絡配置已改變，導致 gpnp profile.xml 不匹配
5. Crsd 的 $GRID_HOME/crs/init/.pid 文件已被手動刪除或重命名，crsd.log 顯示：“Error3 -2 writing PID to the file”
6. ocr.loc 內容與其他集羣節點不匹配。crsd.log 顯示：“Shutdown CacheLocal. my hash ids don't match”

解決方案：

1. 檢查問題 2 的解決方案，確保 ocssd.bin 運行且 ora.cssd 在線
2. 對於 11.2.0.2 以上版本，確保資源 ora.cluster_interconnect.haip 在線，請參考 Document 1383737.1 瞭解和HAIP相關的，ASM無法啓動的問題。
3. 確保 OCR 磁盤可用且可以訪問。如果由於某種原因丟失 OCR，請參考 Document 1062983.1 瞭解如何恢復OCR。
4. 恢復網絡配置，與 $GRID_HOME/gpnp//profiles/peer/profile.xml 中定義的接口相同，請參考
   Document 283684.1 瞭解如何修改私網配置。
5. 請使用 touch 命令，在 $GRID_HOME/crs/init 目錄下創建名爲 .pid 的文件。
   對於 11.2.0.1，該文件歸用戶所有。
   對於 11.2.0.2，該文件歸 root 用戶所有。
6. 使用 ocrconfig 命令修正 ocr.loc 內容：
   例如，作爲 root 用戶：
# ocrconfig -repair -add +OCR2 （添加條目）
# ocrconfig -repair -delete +OCR2 （刪除條目）
以上命令需要 ohasd.bin 啓動並運行。

一旦以上問題得以解決，請通過以下命令重新啓動 GI 或啓動 crsd.bin：
   # crsctl start res ora.crsd -init

問題 4：Agent 或者 mdnsd.bin, gpnpd.bin, gipcd.bin 未運行

症狀：

1. orarootagent 未運行. ohasd.log 顯示:
2012-12-21 02:14:05.071: [ AGFW][24] {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /grid/11.2.0/grid_2/bin/orarootagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/grid/11.2.0/grid_2/bin/orarootagent]
2. mdnsd.bin, gpnpd.bin 或者 gipcd.bin 未運行, 以下是 mdnsd log中顯示的一個例子:
2012-12-31 21:37:27.601: [ clsdmt][1088776512]Creating PID [4526] file for home /u01/app/11.2.0/grid host lc1n1 bin mdns to /u01/app/11.2.0/grid/mdns/init/
2012-12-31 21:37:27.602: [ clsdmt][1088776512]Error3 -2 writing PID [4526] to the file []
2012-12-31 21:37:27.602: [ clsdmt][1088776512]Failed to record pid for MDNSD
或者
2012-12-31 21:39:52.656: [ clsdmt][1099217216]Creating PID [4645] file for home /u01/app/11.2.0/grid host lc1n1 bin mdns to /u01/app/11.2.0/grid/mdns/init/
2012-12-31 21:39:52.656: [ clsdmt][1099217216]Writing PID [4645] to the file [/u01/app/11.2.0/grid/mdns/init/lc1n1.pid]
2012-12-31 21:39:52.656: [ clsdmt][1099217216]Failed to record pid for MDNSD
3. oraagent 或 appagent 未運行, 日誌crsd.log顯示:
2012-12-01 00:06:24.462: [ AGFW][1164069184] {0:2:27} Created alert : (:CRSAGF00130:) : Failed to start the agent /u01/app/grid/11.2.0/bin/appagent_oracle

可能的原因：

1. orarootagent 缺少執行權限
2. 缺少進程相關的 .pid 文件或者這個文件的所有者/權限不對
3. GRID_HOME 所有者/權限不對

解決方案：

1. 和一個好的GRID_HOME比較所有者/權限，並做相應的改正，或者以root用戶執行:，
   # cd /crs/install
   # ./rootcrs.pl -unlock
   # ./rootcrs.pl -patch
這將停止集羣軟件，對需要的文件的所有者/權限設置爲root用戶，並且重啓集羣軟件。
2. 如果對應的 .pid 不存在, 就用touch命令創建一個具有相應所有者/權限的文件, 否則就按要求改正文件.pid的所有者/權限, 然後重啓集羣軟件.
這裏是下，所有者屬於root:root 權限 644的.pid 文件列表：
./ologgerd/init/.pid
./osysmond/init/.pid
./ctss/init/.pid
./ohasd/init/.pid
./crs/init/.pid
所有者屬於:oinstall，權限644
./mdns/init/.pid
./evm/init/.pid
./gipc/init/.pid
./gpnp/init/.pid

3. 對第3種原因，請參考解決方案1

問題 5：ASM 實例未啓動，ora.asm 不在線

症狀：

1. 命令“ps -ef | grep asm”不顯示 ASM 進程
2. 命令“crsctl stat res -t –init”顯示：
ora.asm
1 ONLINE OFFLINE

可能的原因：

1. ASM spfile 損壞
2. ASM discovery string不正確，因此無法發現 voting disk/OCR
3. ASMlib 配置問題
4. ASM實例使用不同的cluster_interconnect, 第一個節點 HAIP OFFLINE 導致第二個節點ASM實例無法啓動

解決方案：

1. 創建臨時 pfile 以啓動 ASM 實例，然後重建 spfile，請參考 Document 1095214.1 瞭解更多詳細信息。
2. 請參考 Document 1077094.1 以更正 ASM discovery string。
3. 請參考 Document 1050164.1 以修正 ASMlib 配置。
4. 請參考 Document 1383737.1 作爲解決方案。請參考 Document 1210883.1 瞭解更多HAIP信息

要進一步調試 GI 啓動問題，請參考 Document 1050908.1 Troubleshoot Grid Infrastructure Startup Issues.

qyq88888

發佈了83 篇原創文章 · 獲贊 6 · 訪問量 25萬+

私信關注

oracle_Grid Infrastructure 啓動的五大問題