凌晨數據庫日誌報如下錯，lmon進程將數據庫實例中止了：

Wed Apr 2 00:41:22 2014

Errors in file/u01/oracle/admin/epqdb/bdump/epqdb1_lmon_19974.trc:

ORA-00481: LMON processterminated with error

Wed Apr 2 00:41:22 2014

LMON: terminating instance dueto error 481

Wed Apr 2 00:41:22 2014

Errors in file/u01/oracle/admin/epqdb/bdump/epqdb1_lmd0_19976.trc:

ORA-00481: LMON processterminated with error

Wed Apr 2 00:41:22 2014

Errors in file/u01/oracle/admin/epqdb/bdump/epqdb1_lms1_19980.trc:

ORA-00481: LMON processterminated with error

Wed Apr 2 00:41:22 2014

Errors in file/u01/oracle/admin/epqdb/bdump/epqdb1_lms0_19978.trc:

ORA-00481: LMON processterminated with error

Wed Apr 2 00:41:22 2014

Errors in file/u01/oracle/admin/epqdb/bdump/epqdb1_pmon_19968.trc:

ORA-00481: LMON processterminated with error

Wed Apr 2 00:41:22 2014

System state dump is made forlocal instance System State dumped to trace file/u01/oracle/admin/epqdb/bdump/epqdb1_diag_19970.trc

Wed Apr 2 00:41:27 2014

Instanceterminated by LMON, pid = 19974

分析：

lmon進程是rac中一個非常關鍵的進程，主要負責集羣之間的健康檢查，提供gcs服務，其中當lmon出現異常時，會觸發oracle級別的io fencing。

LMON主要藉助兩種心跳機制來完成健康檢查：

1）節點間的網絡心跳（NetworkHeartbeat）：可以向節點定時的發送ping包檢測節點狀態，如果能在規定時間內收到迴應，就認爲對方狀態正常

2）通過控制文件的磁盤心跳（ControlfileHeartbeat）：每個節點的CKPT進程每隔3秒更新一次控制文件一個數據塊，這個數據塊叫做CheckpointProgress Record，控制文件是共享的，所以實例間可以相互檢查對方是否及時更新來判斷。

從lmd、lms等trace文件中均發現以下信息：

*** 2014-04-02 00:41:09.821

Received ORADEBUG command 'IPC' from process Unixprocess pid: 19974, image:

Dump of unix-generic skgm context

……

ksxpdmp: facility 2 (SKGFAIPC) (0x1,0x0000000000000000) counts 0, 0

ksxpdmp: Dumping the osd context

SKGXPCTX: 0x80000001001f4b28 ctx

WAIT HISTORY

Time(msec) WaitType Return Code

---------- --------- ------------

30 NORMAL TIMEDOUT

1 NORMAL SUCC

0 NORMAL TIMEDOUT

29 NORMAL TIMEDOUT

20 NORMAL TIMEDOUT

30 NORMAL TIMEDOUT

14 NORMAL SUCC

0 NORMAL TIMEDOUT

36 NORMAL TIMEDOUT

30 NORMAL TIMEDOUT

23 NORMAL TIMEDOUT

20 NORMAL TIMEDOUT

30 NORMAL TIMEDOUT

-- 30 是css timeout默認值

這些信息是在crash前產生的，說明當時網絡出現了問題，影響到了lmon。

查看lmon的trace文件，還發現在做DRM的時候，做到31步出錯：

*** 2014-04-02 00:08:09.995

Begin DRM(1053)

*** 2014-04-02 00:10:31.161

sent syncr inc 12 lvl 4553 to 0 (12,0/31/0)

sent synca inc 12 lvl 4553 (12,0/31/0)

sent syncr inc 12 lvl 4554 to 0 (12,0/34/0)

……

*** 2014-04-02 00:41:04.026

kjfcdrmrfg: SYNC TIMEOUT (864483,863582, 900), step 31

Submitting asynchronized dump request [28]

KJC Communication Dump:

state 0x5 flags 0x0 mode0x0 inst 0 inc 12

nrcv 3 nsp 3 nrcvbuf 1000

reg_msg: sz 456 cur 84 (s:0 i:84) max1526 ini 2750

big_msg: sz 8240 cur 21 (s:0 i:21) max252 ini 1934

rsv_msg: sz 8240 cur 0 (s:0 i:0) max0 tot 1000

rcvr: id 2 orapid 8 ospid 19980

rcvr: id 1 orapid 7 ospid 19978

rcvr: id 0 orapid 6 ospid 19976

send proxy: id 2 ndst 1 (1:2 )

send proxy: id 1 ndst 2 (1:1 1:3 )

send proxy: id 0 ndst 1 (1:0 )

GES resource limits:

ges resources: cur 0 max 0 ini 21522

ges enqueues: cur 0 max 0 ini 33158

ges cresources: cur 3193 max 4007

gcs resources: cur 433506 max 747472 ini952310

gcs shadows: cur 733200 max 868765 ini952310

KJCTS state: seq-check:no timeout:yes waitticks:0x3 highload no

…

error 481 detected in backgroundprocess

ORA-00481: LMON processterminated with error

可以看出在做DRM的時候到31步出錯，到support.oracle.com，查找相關的信息得到如下解釋：

Bug 6500033 LMON crash the instance withORA-481 due to DRM sync timeout

LMON can crash the instance with ORA-481 due to DRMsync timeout.

DIAG dumping Systemstate dump may be aborted due tolog file size limit while in server mode which can cause a DRM synctimeout when lmon unsuccessfully tries to freeze it.

結論：

根據上面的信息，可以得出0:08到0:41之間出現DRM同步超時，導致lmon異常，從而中止了數據庫實例。

解決方法：

關閉DRM

可以進一步檢查一下os是否有相關的錯誤信息，從而更加準確地定位問題根本原因。

可以通過下面兩個隱含參數來禁止DRM的發生：

1. _gc_affinity_time=0

2. _gc_undo_affinity=FALSE

不過，這兩個參數是靜態參數，也就是說必須要重啓實例才能生效。
實際上可以設置另外2個動態的隱含參數，來達到這個目的。

按下面的值設置這2個參數之後，不能完全算是禁止/關閉了DRM，而是從”事實上“關閉了DRM。

1. _gc_affinity_limit=250

2. _gc_affinity_minimum=10485760

甚至可以將以上2個參數值設置得更大。這2個參數是立即生效的，在所有的節點上設置這2個參數之後，系統不再進行DRM。

Sharqueen

發佈了109 篇原創文章 · 獲贊 7 · 訪問量 18萬+

私信關注

ORA-481 DRM同步超時導致lmon將實例中止

分析：

結論：

解決方法：

SQL優化-20231016

DBA_SCHEDULER_JOBS——gather_stats_job

dbca不能delete數據庫+db開機自啓動

【問題記錄】數據庫打不開報ORA-00845錯誤 /dev/shm設置過小

【AIX】topas命令

【RAC】RAC卸載——各部件單個卸載+完整卸載

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結