故障現象描述

數據庫無響應(hang住)故障，常見的就是數據庫實例不能響應客戶端發起的SQL語句請求，客戶端提交一個SQL後，就一直處於等待數據庫實例返回結果的狀態。最爲嚴重的現象是客戶端根本不能連接到數據庫，甚至通過操作系統本地sqlplus / as sysdba命令也不能正常訪問數據庫。發起一個連接請求後，一直處於等待狀態。

對於oracle數據庫一般上面運行的業務都是比較核心，發生了數據庫無響應是必須要及時發現並緊急處理的。

數據庫都部署有監控，一般會接收到類似這樣的告警信息：

無響應故障分析案例

數據庫hang住時收集診斷信息參考文章：

How to Collect Diagnostics for Database Hanging Issues (Doc ID 452358.1)

無響應故障排除方法參考：

Troubleshooting Database Hang Issues (Doc ID 1378583.1)

How to Investigate Slow or Hanging Database Performance Issues (Doc ID 1362329.1)

整個數據庫實例Hang住，需要重啓

這種情況造成的影響非常大。在這個實例上的所有應用系統均受到嚴重影響，並且在找到根源並最終解決問題之前，數據庫實例往往須要重啓，快速恢復業務。

有時候在數據庫實例重啓之後，業務恢復正常，但是可供分析的數據也隨之消失，比如共享池碎片嚴重報ORA-4031錯誤，實例重啓後共享池的狀態數據丟失，無法找到根源問題。

分析hang住原因

1）、檢查數據庫告警日誌

單實例：alert<sid>.log

RAC：alert<sid>.log、集羣alert<hostname>.log

如果沒有發現明顯的異常，進行下一步檢查

2）、測試數據庫是否可以正常連接

本地：sqlplus / as sysdba

遠程: sqlplus system/pass@ip:1521/service_name

3)、若不能正常連接，需要看操作系統是否出現了異常

檢查操作系統：

zabbix監控上查看user cpu 、sys cpu信息

top命令

vmstat 1 10命令

iostat -mx 1

檢查是否有swap嚴重: free -g

檢查操作系統日誌信息：

AIX: errpt

Linux: less /var/log/message

通過簡單快速的分析，若沒發現太多異常信息，或者暫時無法快速分析出問題根因，又或者知道原因但沒法解決。那就要考慮重啓，緊急恢復業務。

重啓之前，先要收集一波日誌

收集Hanganalyze 和 Systemstate信息

整個數據庫實例hang住的情況下，是辦法通過常規方式登錄數據庫的。

單實例環境下收集Hanganalyze 和 Systemstate信息：

Hanganalyze

sqlplus - prelim / as sysdba
oradebug setmypid
oradebug unlimit
oradebug hanganalyze 3
-- Wait one minute before getting the second hanganalyze
oradebug hanganalyze 3
oradebug tracefile_name
exit

Systemstate

sqlplus - prelim / as sysdba
oradebug setmypid
oradebug unlimit
oradebug dump systemstate 258
oradebug dump systemstate 258
oradebug tracefile_name
exit

RAC環境下收集Hanganalyze 和 Systemstate信息：

sqlplus - prelim / as sysdba
oradebug setmypid
oradebug  unlimit
oradebug -g all hanganalyze 3
oradebug -g all hanganalyze 3
oradebug -g all dump systemstate 258
oradebug -g all dump systemstate 258
exit

重啓無響應的數據庫實例,快速恢復業務系統

sqlplus - prelim / as sysdba
shutdown immediate;
必要時需要kill進程
ps -ef|grep "LOCAL=NO"|grep -v grep|awk '{ print $2 }'|xargs kill –9

事後分析Hanganalyze 和 Systemstate信息

對於oracle數據庫，一般alert日誌中都會有詳細的日誌信息。Hang Analyze trace file會給你呈現更詳盡的信息，並且會以會話的維度給你呈現出阻塞鏈關係，對分析問題非常有幫助。

根據經驗來說，當你去分析hang trace文件的時候，一般的結局是遇到了bug，官方的建議就是讓你打補丁或者升級。操作方案都可以在mos文檔上搜索到，Oracle官方依靠強大的知識庫和bug庫，可以讓你找到90%的解決方案。

RAC 數據庫歸檔發生IO異常從而hang住 ('Log archive I/O'<='enq: DM - contention')

參考mos文檔 RAC Database Hangs Trying To Archive ('Log archive I/O'<='enq: DM - contention'). (Doc ID 1565777.1)

故障現象：

1) RAC 數據庫有規律的hang住，每次進行歸檔日誌的時，數據庫無響應

2) 如果爲數據庫做了Hang Analyze 從trace日誌文件中可以看到如下信息：

Chains most likely to have caused the hang:
 [a] Chain 1 Signature: 'Log archive I/O'<='enq: DM - contention'
     Chain 1 Signature Hash: 0x7abeb73c
.
.
.
        session serial #: 5    }
    which is waiting for 'Log archive I/O' with wait info:    {
                      p1: 'count'=0x1
                      p2: 'intr'=0x100
                      p3: 'timeout'=0xffffffff
            time in wait: 4.166054 sec
           timeout after: never
                 wait id: 5635
                blocking: 1 session
             current sql: ALTER DATABASE OPEN            .
.
.

              2.     event: 'Log archive I/O'
                   wait id: 5633            p1: 'count'=0x1
               time waited: 0.013025 sec    p2: 'intr'=0x100
                                            p3: 'timeout'=0xffffffff
              3.     event: 'Log archive I/O'
                   wait id: 5632            p1: 'count'=0x1
               time waited: 3.126933 sec    p2: 'intr'=0x100
                                            p3: 'timeout'=0xffffffff    }

Chain 1 Signature: 'Log archive I/O'<='enq: DM - contention'
Chain 1 Signature Hash: 0x7abeb73c

故障原因分析：

1）RAC數據庫使用兩個歸檔路徑

1.1）第一個路徑在共享ASM磁盤組中(+RECO)

log_archive_dest_1 = "LOCATION=USE_DB_RECOVERY_FILE_DEST"

db_recovery_file_dest = "+RECO"

1.2）第二個路徑在NAS文件系統中

log_archive_dest_2 = "LOCATION=/NAS1/arch01"

2) 在ASM磁盤組歸檔日誌路徑中是沒有任何問題的

3）NAS/NFS文件系統輔助歸檔日誌位置在被訪問時出現掛起問題，例如NFS文件系統掛起的“df”操作系統命令

4）此外，OS日誌報告了網絡接口上的幾個網絡問題(包括用於訪問NAS的交換機)。

故障解決方案

1)刪除NAS /NFS歸檔og目標(從受影響的數據庫中)，只保留磁盤組歸檔og位置。

2)另外，從所有RAC節點卸載NAS文件系統，直到解決NAS問題爲止。

3)刪除NAS /NFS歸檔og目標(從受影響的數據庫中)後，數據庫將在兩個節點上打開，並且能夠再次歸檔redolog文件(再次生成歸檔日誌)。

4)稍後您需要修復NAS問題，並在修復後將其添加回NAS位置作爲輔助存檔位置。

ADG備庫無響應(大量row cache enqueue lock日誌信息)

參考文章：

Active Dataguard Standby Database Hang Due To ROW CACHE ENQUEUE LOCK (Doc ID 2586299.1)

故障現象：

ADG備庫遇到掛起情況。
在ADG備庫上面執行任何SQL語句都發生阻塞，無響應
主庫沒有問題，主庫中SQL語句運行良好。
備庫的警報日誌中頻繁出現ROW CACHE ENQUEUE LOCK信息。
Systemstate dump顯示爲DC_OBJECTS下的行緩存對象檢測到死鎖。

故障原因分析：

Hanganalyze trace 日誌如下：

Chains most likely to have caused the hang:
[a] Chain 1 Signature: 'row cache lock'<='cursor: pin S wait on X'
    Chain 1 Signature Hash: 0x6b385219
[b] Chain 2 Signature: 'row cache lock'
    Chain 2 Signature Hash: 0x95d00c11
[c] Chain 3 Signature: 'row cache lock'
    Chain 3 Signature Hash: 0x95d00c11

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
   Oracle session identified by:
   {
               instance: 1 (xyz.xyz)
                  os id: 15615
             process id: 40, oracle@abc-dg (MMON)
             session id: 1121
       session serial #: 52846
   }
   is waiting for 'cursor: pin S wait on X' with wait info:
   {
...
   }
   and is blocked by
=> Oracle session identified by:
   {
               instance: 1 (xyz.xyz)
                  os id: 378537
             process id: 350, oracle@abc-dg (W001)
             session id: 397
       session serial #: 32479
   }
   which is waiting for 'row cache lock' with wait info:
   {
                     p1: 'cache id'=0x8
                     p2: 'mode'=0x0
                     p3: 'request'=0x3
           time in wait: 5247 min 32 sec (last interval)
           time in wait: 5747 min 47 sec (total)
          timeout after: never
                wait id: 707
               blocking: 1 session
            current sql:  select s.file#, s.block#, s.ts#, t.obj#, s.hwmincr, t.obj# 
            from tab$ t, seg$ s 
            where bitand(s.spare1, 4503599627370496) = 4503599627370496 
            and bitand(s.spare1, 65536) <> 65536  and s.file# = t.file# and s.ts# = t.ts# 
            and s.block# = t.block# 
            UNION  
            select s.file#, s.block#, s.ts#, t.obj#, s.hwmincr, tab.obj# from tabp

Blocking process is waiting on "row cache lock" for long time.
The blocker is running SQL statements against some bootstrap objects like TAB$,SEG$,OBJ$ etc and waiting for row cache enqueue in DC_OBJECTS.

The Function Stack of Blocker:

ksedsts <- ksdxfstk <- ksdxcb <- sspuser <- __sighandler

<- semtimedop <- skgpwwait <- ksliwat <- kslwaitctx <- kqrget

<- kqrLockAndPinPo <- kqrpre1 <- kkdlSetTableVersion <- kkdlgstd <- kkmfcbloCbk

<- kkmpfcbk <- qcsprfro <- qcsprfro_tree <- qcsprfro_tree <- qcspafq

<- qcspqbDescendents <- qcspqb <- kkmdrv <- opiSem <- opiprs

<- kksParseChildCursor <- rpiswu2 <- kksLoadChild <- kxsGetRuntimeLock

This is due to below bug:

Bug 27716177 : ADG: ORA-04021:ORA-04024:ROW CACHE ENQUEUE AGAINST DC_OBJECTS:OBJ$

Duplicate of

Bug 28228168 : ORA-04024: SELF-DEADLOCK DETECTED WHILE TRYING TO MUTEX PIN CURSOR

Duplicate of

Unpublished Bug 28423598 : ROW CACHE ENQUEUE AGAINST DC_OBJECTS:OBJ$ on Active Data Guard

Document 28423598.8 ORA-4021: ORA-4024: ROW CACHE ENQUEUE AGAINST DC_OBJECTS:OBJ$ on Active Data Guard

Dataguard Standby database could freeze and wait on row cache enqueue whilst trying to apply a change to a bootstrap object (eg. OBJ$). Sometimes it can crash as well.

解決方案：

Apply the latest RU for 12.2.0.1 (12.2.0.1.190716DBJUL2019RU) which includes fix for the Bug 28423598. Patch 29757449 for 12.2.0.1.190716 RU.

Apply One-off Patch for Bug 28423598.

數據庫無響應(hang住)故障處理思路和方法

故障現象描述

無響應故障分析案例

整個數據庫實例Hang住，需要重啓

分析hang住原因

收集Hanganalyze 和 Systemstate信息

重啓無響應的數據庫實例,快速恢復業務系統

事後分析Hanganalyze 和 Systemstate信息

RAC 數據庫歸檔發生IO異常從而hang住 ('Log archive I/O'<='enq: DM - contention')

故障現象：

故障原因分析：

故障解決方案

ADG備庫無響應(大量row cache enqueue lock日誌信息)

故障現象：

故障原因分析：

解決方案：

2024年DataOps趨勢預測：AI不會取代數據工程師

雲原生週刊：K8s 中的服務和網絡｜ 2024.4.29

通過Http鏈接地址爬取有贊微信商城商品信息及下載至EXCEL

多人同時導出 Excel 幹崩服務器！新來的阿里大佬給出的解決方案太優雅了！

[轉帖]cpupower

今天，昨天，近七天，近30天，近90天，js封裝

華爲云云原生FinOps解決方案，釋放雲原生最大價值

oracle查詢表以及表上索引佔用空間大小

Oracle統計產生日誌&數據增長&增量

Oracle sqlplus使用總結

Oracle中字符串轉義問題總結

Oracle dbms_job管理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結