oracle數據庫hanganalyze

爲什麼要使用hanganalyze

Oracle 數據庫“真的”hang住了,可以理解爲數據庫內部發生死鎖。因爲普通的DML死鎖,oracle服務器會自動監測他們的依賴關係,並回滾其中一個操作, 終止這種相互等待的局面。而當這種死鎖發生在爭奪內核級別的資源(比如說是pins或latches)時,Oracle並不能自動的監測並處理這種死鎖。
其實很多時候數據庫並沒有hang住,而只是由於數據庫的性能問題,處理的時間比較長而已。
Hanganalyze工具使用內核調用檢測會話在等待什麼資源,報告出佔有者和等待者的相互關係。另外,它還會將一些比較”interesting”的進程狀態dump出來,這個取決於我們使用hanganalyze的分析級別。
hanganalyze工具從oracle8i第二版開始提供,到9i增強了診斷RAC環境下的“集羣範圍”的信息,這意味着它將會報告出整個集羣下的所有會話的信息。
目前有三種使用hanganalyze的方法:

--一種是會話級別的:
SQL>ALTER SESSION SET EVENTS 'immediate trace name HANGANALYZE level <level>';
--一種是實例級別:
SQL>ORADEBUG hanganalyze <level>
--一種是集羣範圍的:
SQL>ORADEBUG setmypid
SQL>ORADEBUG setinst all
SQL>ORADEBUG -g def hanganalyze <level>
各個level的含義如下:
1-2:只有hanganalyze輸出,不dump任何進程
3:Level2+Dump出在IN_HANG狀態的進程
4:Level3+Dump出在等待鏈裏面的blockers(狀態爲LEAF/LEAF_NW/IGN_DMP)
5:Level4+Dump出所有在等待鏈中的進程(狀態爲NLEAF)
Oracle官方建議不要超過level 3,一般level 3也能夠解決問題,超過level 3會給系統帶來額外負擔。
hanganalyze實驗

1.session1更新行數據

SQL> connect scott/scott
Connected.
SQL> create table tb_hang(id number,remark varchar2(20));
Table created.
SQL> insert into tb_hang values(1,'test');
1 row created.
SQL> commit;
Commit complete.
SQL> select USERENV('sid') from dual;
USERENV('SID')
--------------
           146
SQL> update tb_hang set remark='hang' where id=1;
1 row updated.
--這個時候不提交
2.session2同樣更新session1更新的行
SQL> select USERENV('sid') from dual;            
USERENV('SID')
--------------
           154
SQL>  update tb_hang set remark='hang' where id=1;
--這個時候已經hang住了
3.session3使用hangalyze生成trace文件
SQL> connect / as sysdba
Connected.
SQL> oradebug hanganalyze 3;
Hang Analysis in /u01/app/oracle/admin/oracl/udump/oracl_ora_3941.trc
4.查看trace文件的內容
$ more /u01/app/oracle/admin/oracl/udump/oracl_ora_3941.trc
/u01/app/oracle/admin/oracl/udump/oracl_ora_3941.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options
ORACLE_HOME = /u01/app/oracle/product/10.2.0/db_1
System name:    Linux
Node name:      hxl
Release:        2.6.18-8.el5xen
Version:        #1 SMP Fri Jan 26 14:42:21 EST 2007
Machine:        i686
Instance name: oracl
Redo thread mounted by this instance: 1
Oracle process number: 21
Unix process pid: 3941, image: oracle@hxl (TNS V1-V3)
*** SERVICE NAME:(SYS$USERS) 2012-06-16 01:13:29.241
*** SESSION ID:(144.14) 2012-06-16 01:13:29.241
*** 2012-06-16 01:13:29.241
==============
HANG ANALYSIS:
==============
Open chains found:
Chain 1 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/146/5/0x7861b254/3858/SQL*Net message from client>
 -- <0/154/5/0x7861c370/3903/enq: TX - row lock contention>
Other chains found:
Chain 2 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/144/14/0x7861d48c/3941/No Wait>
Chain 3 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/149/1/0x7861ced8/3806/Streams AQ: waiting for time man>
Chain 4 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/151/1/0x7861c924/3804/Streams AQ: qmn coordinator idle>
Chain 5 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <0/158/5/0x7861da40/3810/Streams AQ: qmn slave idle wait>
Extra information that will be dumped at higher levels:
[level  4] :   1 node dumps -- [REMOTE_WT] [LEAF] [LEAF_NW]
[level  5] :   4 node dumps -- [SINGLE_NODE] [SINGLE_NODE_NW] [IGN_DMP]
[level  6] :   1 node dumps -- [NLEAF]
[level 10] :  13 node dumps -- [IGN]
 
State of nodes
([nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predec
essor):
[143]/0/144/14/0x786fa3dc/3941/SINGLE_NODE_NW/1/2//none
[145]/0/146/5/0x786fc944/3858/LEAF/3/4//153
[148]/0/149/1/0x78700160/3806/SINGLE_NODE/5/6//none
[150]/0/151/1/0x787026c8/3804/SINGLE_NODE/7/8//none
[153]/0/154/5/0x78705ee4/3903/NLEAF/9/10/[145]/none
[154]/0/155/1/0x78707198/3797/IGN/11/12//none
[155]/0/156/1/0x7870844c/3799/IGN/13/14//none
[157]/0/158/5/0x7870a9b4/3810/SINGLE_NODE/15/16//none
[159]/0/160/1/0x7870cf1c/3782/IGN/17/18//none
[160]/0/161/1/0x7870e1d0/3784/IGN/19/20//none
[161]/0/162/1/0x7870f484/3788/IGN/21/22//none
[162]/0/163/1/0x78710738/3786/IGN/23/24//none
[163]/0/164/1/0x787119ec/3774/IGN/25/26//none
[164]/0/165/1/0x78712ca0/3780/IGN/27/28//none
[165]/0/166/1/0x78713f54/3778/IGN/29/30//none
[166]/0/167/1/0x78715208/3776/IGN/31/32//none
[167]/0/168/1/0x787164bc/3770/IGN/33/34//none
[168]/0/169/1/0x78717770/3772/IGN/35/36//none
[169]/0/170/1/0x78718a24/3768/IGN/37/38//none
====================
END OF HANG ANALYSIS
====================

Trace文件內容的解釋如下:

CYCLES: This section reports the process dependencies between sessions that are in a deadlock condition. Cycles are considered   “true” hangs.

Cycle 1 : <sid/sess_srno/proc_ptr/ospid/wait_event> :
    <980/3887/0xe4214964/24065/latch free>
 -- <2518/352/0xe4216560/24574/latch free>
 -- <55/10/0xe41236a8/13751/latch free>
BLOCKER OF MANY SESSIONS:This section is found when a process is blocking a lot of other sessions. Usually when a process is blocking more that 10 sessions this section will appear in the trace file.
Found 21 objects waiting for <sid/sess_srno/proc_ptr/ospid/wait_event>
    <55/10/0xe41236a8/13751/latch free>
Found 12 objects waiting for <sid/sess_srno/proc_ptr/ospid/wait_event>
    <2098/2280/0xe42870d0/3022/db file scattered read>
Found 12 objects waiting for <sid/sess_srno/proc_ptr/ospid/wait_event>
    <1941/1783/0xe41ac9e0/462/No Wait>
Found 12 objects waiting for <sid/sess_srno/proc_ptr/ospid/wait_event>
    <980/3887/0xe4214964/24065/latch free>
OPEN CHAINS:This section reports sessions involved on a wait chain. A wait chains means that one session is blocking one or more other sessions.
Open chains found:
Chain 1 : <sid/sess_srno/proc_ptr/ospid/wait_event> :
    <2/1/0xe411b0f4/12280/db file parallel write>
Chain 2 : <sid/sess_srno/proc_ptr/ospid/wait_event> :
    <3/1/0xe411b410/12282/No Wait>
Chain 6 : <sid/sess_srno/proc_ptr/ospid/wait_event> :
    <18/1631/0xe4243cf8/25457/db file scattered read>
 -- <229/1568/0xe422b84c/8460/buffer busy waits>
Chain 17 : <sid/sess_srno/proc_ptr/ospid/wait_event> :
    <56/11/0xe4123ce0/13755/latch free>
 -- <2384/599/0xe41890dc/22488/latch free>
 -- <32/2703/0xe41fa284/25693/latch free>

OTHER CHAINS: It refers to chains of blockers and waiters related to other sessions identified under “open chains”, but not blocked directly by the process reported on the "open chain". 

Other chains found:

Chain 676 : <sid/sess_srno/proc_ptr/ospid/wait_event> :
    <20/93/0xe411d644/13597/latch free>
Chain 677 : <sid/sess_srno/proc_ptr/ospid/wait_event> :
    <27/1201/0xe41d3188/15809/latch free>
Chain 678 : <sid/sess_srno/proc_ptr/ospid/wait_event> :
    <36/1532/0xe428be8c/4232/latch free>
 -- <706/1216/0xe4121aac/23317/latch free>
Chain 679 : <sid/sess_srno/proc_ptr/ospid/wait_event> :
    <43/12/0xe4122d54/13745/latch free>
Chain 680 : <sid/sess_srno/proc_ptr/ospid/wait_event> :
    <80/2/0xe41290d4/13811/library cache pin>
 -- <1919/1134/0xe421fdbc/3343/enqueue>
STATE OF NODES:
nodenum:定義每個session的序列號,trace文件內部使用
sid: session的sid
sess_srno:session的Serial#
ospid:OS的進程ID
state: node的狀態
adjlist:表示blocker node
predecessor:表示waiter node
IN_HANG:這表示該node處於死鎖狀態,通常還有其他node(blocker)也處於該狀態
LEAF/LEAF_NW:該node通常是blocker.通過條目的”predecessor”列可以判斷這個node是否是blocker。LEAF說明該NODE沒有等待其他資源,而LEAF_NW則可能是沒有等待其他資源或者是在使用CPU
NLEAF:通常可以看作這些會話是被阻塞的資源.發生這種情況一般說明數據庫發生性能問題而不是數據庫hang
IGN/IGN_DMP:這類會話通常被認爲是空閒會話,除非其adjlist列裏存在node。如果是非空閒會話則說明其adjlist裏的node正在等待其他node釋放資源。
SINGLE_NODE/SINGLE_NODE_NW:近似於空閒會話.


1.trace中最重要的部分

State of nodes
([nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor):
[145]/0/146/5/0x786fc944/3858/LEAF/9/10//153 --這裏的LEAF表示該SID阻塞了predec
essor=153對應的SID=154,即爲被阻塞者
[148]/0/149/1/0x78700160/3806/SINGLE_NODE/11/12//none
[150]/0/151/1/0x787026c8/3804/SINGLE_NODE/13/14//none
[153]/0/154/5/0x78705ee4/3903/NLEAF/15/16/[145]/none --這裏的NLEAF表示該SID被阻塞了,adjlist對應的145對應的SID=146,即爲阻塞者.


實際案例
執行一個split分區的腳本時長時間沒有響應。登錄上去查看,手工執行了split腳本,發現確實會hang住:

SQL>ALTER TABLE A_PDA_SP_STAT SPLIT PARTITIONP_MAXAT(20090609)
INTO (PARTITIONP_20090608 TABLESPACE TS_DATA_A,PARTITIONP_MAX TABLESPACE TS_DATA_A)
--檢查該session的等待事件:
EVENT                          P1         P2         P3
------------------------------ ---------- ---------- ----------
rowcachelock                            8          0          5
查 了網上的一些資料,說和sga的shared pool大小不足有關,或者和sequence的cache不大有關。經過分析,這2個原因應該都不是。因爲1、如果是shared pool不足,這樣的現象一般是某個sql執行的比較慢,但是還是會執行完,而不是像現在這樣的掛住;2,只是執行split分區,並沒有和 sequence相關。
在這裏,我們用hanganalyze來進行分析。
我們現在來看看出來的trace文件:
SQL> select spid from v$session a,v$process b where a.paddr=b.addr and a.sid=295;
SPID
------------
19237
SQL> oradebug SETOSPID 19237
Oracle pid: 235, Unix process pid: 19237, image: oracle@hl_rdb01 (TNS V1-V3)
SQL> oradebug hanganalyze 3;
Cycle 1: (0/295)
Cycle 2: (0/254)--(0/239)
Hang Analysis in /oracle/app/oracle/admin/hlreport/udump/hlreport_ora_25247.trc
$ more /oracle/app/oracle/admin/hlreport/udump/hlreport_ora_25247.trc
Dump file /oracle/app/oracle/admin/hlreport/udump/hlreport_ora_25247.trc
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.6.0 - Production
ORACLE_HOME = /oracle/app/oracle/product/9.2.0
System name:    HP-UX
Node name:      hl_rdb01
Release:        B.11.11
Version:        U
Machine:        9000/800
Instance name: hlreport
Redo thread mounted by this instance: 1
Oracle process number: 157
Unix process pid: 25247, image: oracle@hl_rdb01 (TNS V1-V3)
 
*** SESSION ID:(312.10459) 2009-05-20 16:21:58.423
*** 2009-05-20 16:21:58.423
==============
HANG ANALYSIS:
==============
 Cycle 1 :<cnode/sid/sess_srno/proc_ptr/ospid/wait_event>:
    <0/329/43816/0x4d6b5638/23487/rowcachelock>
 --<0/254/19761/0x4d687438/23307/librarycachelock>
Cycle 2 :<cnode/sid/sess_srno/proc_ptr/ospid/wait_event>:
    <0/295/57125/0x4d6b8978/19237/rowcachelock>
Cycle 3 :<cnode/sid/sess_srno/proc_ptr/ospid/wait_event>:
    <0/295/57125/0x4d6b8978/19237/rowcachelock>
Open chains found:
Other chains found:
Chain 1 :<cnode/sid/sess_srno/proc_ptr/ospid/wait_event>:
    <0/312/10459/0x4d69f9b8/25247/NoWait>
Extra information that will be dumped at higher levels:
[level  3] :   4 node dumps -- [IN_HANG]
[level  5] :   1 node dumps -- [SINGLE_NODE] [SINGLE_NODE_NW] [IGN_DMP]
[level 10] : 223 node dumps -- [IGN]
 
State of nodes
([nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor):
[0]/0/1/1/0x4d7146c0/5132/IGN/1/2//none
……………………………………………………
[238]/0/239/57618/0x4d7b18a0/13476/IN_HANG/395/402/[294][238][328][253]/none
……………………………………………………
[253]/0/254/19761/0x4d7bb710/23307/IN_HANG/397/400/[328][238][294]/294
………………………………………………………………
[294]/0/295/57125/0x4d7d6820/19237/IN_HANG/396/401/[294][238][253]/238
[328]/0/329/43816/0x4d7ecf40/23487/IN_HANG/398/399/[253]/253
………………………………………………………………
Dumping System_State and Fixed_SGA in process with ospid 13476
Dumping Process information for process with ospid 13476
Dumping Process information for process with ospid 23307
Dumping Process information for process with ospid 19237
Dumping Process information for process with ospid 23487
====================
END OF HANG ANALYSIS
====================
*** 2009-05-20 16:48:20.686
現在我們來看看我們的trace出來的文件:
Cycle 1 ::
<0/329/43816/0x4d6b5638/23487/row cache lock>
— <0/254/19761/0x4d687438/23307/library cache lock>
Cycle 2 ::
<0/295/57125/0x4d6b8978/19237/row cache lock>
Cycle 3 ::
<0/295/57125/0x4d6b8978/19237/row cache lock>
cycle表示oracle內部確定的死鎖。其中我們的當前手工執行split的295進程也在裏面。我們觀察其他的進程在做什麼,如329:
SQL>selectmachine,status,program,sql_textfromv$sessiona,v$sqlareab
wherea.sql_address=b.addressanda.sid=329;
MACHINE   STATUS  PROGRAM                    SQL_TEXT                                      
--------- ------- -------------------------- ----------------------------------------------------------
hl_rdb01  ACTIVE  sqlplus@hl_rdb01(TNSV1-V3) ALTER TABLEA_PDA_SP_STATS PLITPARTITION P_MAXAT(20090609) 
                                             INTO(PARTITION P_20090608  TABLESPACETS_DATA_A  ,PARTITION 
                                             P_MAX TABLESPACETS_DATA_A)

SQL>select event from v$session_wait wheresid=329;
EVENT
--------------------------------------------
row cache lock
發現也是在執行split語句,但是問了同事,他已經把之前運行失敗的腳本完全kill掉了。估計在數據庫中進程掛死了,沒有完全的釋放。
kill掉329號進程後,發現還是掛住,所以我們繼續做hanganlyze:
==============
HANG ANALYSIS:
==============
Cycle 1 :<cnode/sid/sess_srno/proc_ptr/ospid/wait_event>:
    <0/295/57125/0x4d6b8978/19237/rowcachelock>
Cycle 2 :<cnode/sid/sess_srno/proc_ptr/ospid/wait_event>:
    <0/254/19761/0x4d687438/23307/librarycachelock>
 --<0/239/57618/0x4d6b74f8/13476/rowcachelock>
我們繼續把其他的進程殺掉。終於295的split執行成功。

SQL>ALTER TABLEA_PDA_SP_STAT SPLIT  PARTITIONP_MAXAT(20090609)
 INTO(PARTITIONP_20090608 TABLESPACETS_DATA_A  ,PARTITION P_MAX TABLESPACETS_DATA_A)
Table altered.
Elapsed:00:31:03.21
繼續執行split下一個分區,也很快完成。
SQL>ALTER TABLEA_PDA_SP_STATS PLITPARTITION P_MAXAT(20090610)
INTO(PARTITIONP_20090609 TABLESPACETS_DATA_A
,PARTITIONP_MAX TABLESPACETS_DATA_A);
至此,問題解決.
[238]/0/239/57618/0x4d7b18a0/13476/IN_HANG/395/402/[294][238][328][253]/none
[253]/0/254/19761/0x4d7bb710/23307/IN_HANG/397/400/[328][238][294]/294
[294]/0/295/57125/0x4d7d6820/19237/IN_HANG/396/401/[294][238][253]/238
[328]/0/329/43816/0x4d7ecf40/23487/IN_HANG/398/399/[253]/253
329堵塞住了254
254堵塞住了295
295堵塞住了239
殺掉的應該是329,254,295

提示:(如果由於hang,導致sqlplus無法登陸可以使用參數 -perlim 如:sqlplus -prelim "/as sysdba"。但是這樣登陸無法查詢視圖,不過可以shutdown abort 數據庫)

發佈了48 篇原創文章 · 獲贊 19 · 訪問量 37萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章