aix 10G RAC alert日誌報錯LMS 0: 8069 GCS shadows traversed, 4001 replayed

今日有套aix 10G RAC數據庫節點1alert日誌報LMS 0: 8069 GCS shadows traversed, 4001 replayed如下錯誤,因節點2重啓導致。

 

後上網查看了些資料,如果修改系統時間也會報如上錯誤並導致機器重啓。

 

轉載下itpub上kamus的一篇文章:

 

 

除了Windows和Linux,10.2.0.2以後的RAC是不是修改操作系統時間都會導致操作系統重啓

在Oracle10.2.0.3 RAC的測試中,發現如果修改某個節點的系統時間超過1.5秒,那麼這個節點會被自動重新啓動。

好狠的處理方式 ......

詳細機制參見Internal Only的Metalink Note 308051.1。

The OPROCD executable sets a signal handler for the SIGALRM handler and sets the interval timer based on the to-millisec parameter provided.  The alarm handler gets the current time and checks it against the time that the alarm handler was last entered.  If the difference exceeds (to-millisec + margin-millisec), it will fail; the production version will cause a node reboot. 

嘗試修改/etc/init.cssd中關於OPROCD的配置,將DISABLE_OPROCD設置爲TRUE,然後重新啓動系統,在系統進程中已經不存在oprocd進程,但是居然修改完系統時間以後,機器仍然被重新啓動了。

文檔中另外的描述提到,如果OPROCD是在non fatal mode狀態下啓動的,那麼將只會寫一段log而不去重新啓動機器,並且在Note:265769.1中也描述瞭如何修改爲non fatal mode,但是我沒有去嘗試。

In fatal mode, OPROCD will reboot the node if it detects excessive wait. In Non Fatal mode, it will write an error message out to the file <hostname>.oprocd.log in one of the following directories.

最後嘗試的結果是將整個cssd進程disable掉,這樣可以避免因爲修改系統時間而引起機器重啓。

這段時間發現Oracle10g的CRS確實有些霸道,上次的測試中拔掉Private IP網卡上的網線,操作系統會重新啓動,這次居然修改系統時間也會導致系統重啓,真當這些機器是Windows了?UNIX Server中重啓一次機器多大的事兒啊,CRS搞的跟喫飯一樣隨意,動不動reboot。

下面的這段資料描述了Oracle CRS的三個進程會在哪些狀態下重新啓動機器。

Oracle clusterware has the following three daemons which may be responsible for panicing the node. It is possible that some other external entity may have rebooted the node. In the context of this discussion, we will assume that the reboot/panic was done by an Oracle clusterware daemon.

* Oprocd  - Cluster fencing module
* Cssd - Cluster sychronization module which manages node membership
* Oclsomon - Cssd monitor which will monitor for cssd hangs

OPROCD This is a daemon that only gets activated when there is no vendor clusterware present on the OS. This daemon is also not activated to run on Windows/Linux.  This daemon runs a tight loop and if it is not scheduled for 1.5 seconds, will reboot the node.
CSSD This daemon pings the other members of the cluster over the private network and Voting disk. If this does not get a response for Misscount seconds and Disktimeout seconds respectively, it will reboot the node.
Oclsomon This daemon monitors the CSSD to ensure that CSSD is scheduled by the OS, if it detects any problems it will reboot the node.

需要找到方法去禁用這些reboot的特性,reboot了你又不能解決問題,瞎操什麼心嘛。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章