ORA-00600: internal error code, arguments: [kfrValAcd30]

今天客戶反饋一個工程庫遇到了ORA-00600: internal error code, arguments: [kfrValAcd30]錯誤
這將導致磁盤組無法MOUNT,如果磁盤組無法MOUNT,數據庫也就無法正常OPEN。
客戶反饋說之前現網也多次遇到該錯誤,通常的恢復辦法都是通過AMDU把文件抽取出來重新創建磁盤組,而這次工程庫是新建的庫,所以客戶索性通過重建庫來避免該錯誤。

我們試想一下,對於一個上TB的庫,僅僅因爲一個元數據有問題而恢復或者重建,那工作量就有點大了,當然類似問題能快速修復的則快速修復,儘量避免AMDU或者數據庫恢復類的大動作發生,減少業務宕機時間。

下面是根據ROGER的思路,我自己模擬出ORA-00600[kfrValAcd30]故障,然後對其進行修復,希望下次我們再次遇到類似故障,能試着嘗試修復一下,看是否能通過此辦法進行修復。
首先對其錯誤簡單進行解釋一下:

ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [40], [8319], [42], [8319], [], [], [], [], []
這個錯誤表示磁盤組在MOUNT時,ASM實例在進行實例恢復時讀取檢查點信息失敗導致,即是Active Change Diectory Checkpoint損壞導致。

NOTE: starting recovery of thread=1 ckpt=40.8319 group=1 (DATA)
Errors in file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_ora_17253.trc  (incident=25737):
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [40], [8319], [42], [8319], [], [], [], [], []

猜測這裏對參數應該是:
[DATA], [1] <<< DISKGROUP_NAME,DISKGROUP_NUMBER;
[40], [8319]<<< 當前的seq,blk;
[42], [8319]<<< 需要的seq,blk;

因此屬於ASM ACD(Active Change Diectory)元數據發生損壞導致。關於ASM的ACD,COD等知識請查閱大師ROGER的文檔:
http://www.killdb.com/2013/01/16/oracle-asm%E5%89%96%E6%9E%90%E7%B3%BB%E5%88%977-active-change-directory/

環境:

DB: STAND ALONE Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
OS: Red Hat Enterprise Linux Server release 6.8 (Santiago)

故障重現

由於我們知道是ACD損壞,我們首先找到得找到ACD元數據所在的塊,ACD屬於ASM 3號文件,因此我們可以用下列SQL找到它的AU分佈細節。

SQL> 
SQL> SELECT xnum_kffxp "Virtual extent",
  2  pxn_kffxp "Physical extent",
  3  au_kffxp "Allocation unit",
  4  disk_kffxp "Disk"
  5  FROM x$kffxp WHERE group_kffxp=1 -- Diskgroup 1 (DATA)
  6  and number_kffxp=3 
  7  ORDER BY 1, 2;

Virtual extent Physical extent Allocation unit       Disk
-------------- --------------- --------------- ----------
             0               0               5          3
             0               1               4          1
             0               2      4294967294      65534
             1               3               5          1
             1               4               6          3
             1               5      4294967294      65534
             2               6               7          3
             2               7               6          1
             2               8      4294967294      65534
:::
            41             123              46          1
            41             124              46          3
            41             125      4294967294      65534
    2147483648               0              24          1
    2147483648               1              47          3
    2147483648               2      4294967294      65534

129 rows selected.

SQL> 

從上面列出的內容可以看出,ACD第一個Virtual extent在disk3的AU5和disk1的AU4上。

SQL> col path for a20 
SQL> select disk_number,path from v$asm_disk;

DISK_NUMBER PATH
----------- --------------------
          0 /dev/asm-diskd
          2
          1 /dev/asm-diskc         <<< disk 1 的路徑
          3 /dev/asm-diskb         <<< disk 3 的路徑

ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  NORMAL  Y         512   4096  1048576      4096        0                0               0              1             N  DATA/
ASMCMD> 

這裏DISKGROUP的冗餘模式爲NORMAL

我們先通過下列命令備份ACD的元數據,元數據在第一個BLOCK

brucesong:/home/grid$
brucesong:/home/grid$
brucesong:/home/grid$kfed read /dev/asm-diskc aun=4 blkn=0 text=aun4_blkn0.txt

修改aun4_blkn0.txt中的kfracdc.ckpt.seq爲一個較小的值,這裏原來是42,我修改爲40

brucesong:/home/grid$vi aun4_blkn0.txt 
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            7 ; 0x002: KFBTYP_ACDC
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       3 ; 0x008: file=3
kfbh.check:                  1111673974 ; 0x00c: 0x4242cc76
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfracdc.eyec[0]:                     65 ; 0x000: 0x41
kfracdc.eyec[1]:                     67 ; 0x001: 0x43
kfracdc.eyec[2]:                     68 ; 0x002: 0x44
kfracdc.eyec[3]:                     67 ; 0x003: 0x43
kfracdc.thread:                       1 ; 0x004: 0x00000001
kfracdc.lastAba.seq:         4294967295 ; 0x008: 0xffffffff
kfracdc.lastAba.blk:         4294967295 ; 0x00c: 0xffffffff
kfracdc.blk0:                         1 ; 0x010: 0x00000001
kfracdc.blks:                     10751 ; 0x014: 0x000029ff
kfracdc.ckpt.seq:                    40 ; 0x018: 0x00000028   <<< 原來的值 kfracdc.ckpt.seq:                    42 ; 0x018: 0x0000002a
kfracdc.ckpt.blk:                  8319 ; 0x01c: 0x0000207f
kfracdc.fcn.base:                 66973 ; 0x020: 0x0001059d
kfracdc.fcn.wrap:                     0 ; 0x024: 0x00000000
kfracdc.bufBlks:                    256 ; 0x028: 0x00000100
kfracdc.strt112.seq:                  2 ; 0x02c: 0x00000002
kfracdc.strt112.blk:                  0 ; 0x030: 0x00000000
~
~
~
"aun4_blkn0.txt" 27L, 1570C written    

merge到/dev/asm-diskc,/dev/asm-diskb...

brucesong:/home/grid$kfed merge /dev/asm-diskc aun=4 text =aun4_blkn0.txt 
brucesong:/home/grid$kfed merge /dev/asm-diskb aun=5 blkn=0 text = aun4_blkn0.txt 
brucesong:/home/grid$

啓動ASM實例

SQL> startup
ASM instance started

Total System Global Area 1135747072 bytes
Fixed Size                  2260728 bytes
Variable Size            1108320520 bytes
ASM Cache                  25165824 bytes
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [40],    <<< 可以看到報錯已經重現了
[8319], [42], [8319], [], [], [], [], []

ASM alert.log

NOTE: starting recovery of thread=1 ckpt=40.8319 group=1 (DATA)
Errors in file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_ora_17253.trc  (incident=25737):
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [40], [8319], [42], [8319], [], [], [], [], []
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM/incident/incdir_25737/+ASM_ora_17253_i25737.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_ora_17253.trc:
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [40], [8319], [42], [8319], [], [], [], [], []
NOTE: crash recovery signalled OER-600
ERROR: ORA-600 signalled during mount of diskgroup DATA

故障已重現,接下來進行修復

根據提示,將aun4_blkn0.txt文件中的kfracdc.ckpt.seq修改爲需要的42

kfracdc.ckpt.seq:                    42 ; 0x018: 0x0000002a

再次merge到/dev/asm-diskc,/dev/asm-diskb...

brucesong:/home/grid$kfed merge /dev/asm-diskc aun=4 text =aun4_blkn0.txt 
brucesong:/home/grid$kfed merge /dev/asm-diskb aun=5 blkn=0 text = aun4_blkn0.txt 
brucesong:/home/grid$

MOUNT DISKGROUP

SQL> alter diskgroup all mount;
alter diskgroup all mount
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [42],
[8319], [43], [8323], [], [], [], [], []

ASM alert.log

NOTE: starting recovery of thread=1 ckpt=42.8319 group=1 (DATA)
Errors in file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_ora_17253.trc  (incident=25738):
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [42], [8319], [43], [8323], [], [], [], [], []
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM/incident/incdir_25738/+ASM_ora_17253_i25738.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_ora_17253.trc:
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [42], [8319], [43], [8323], [], [], [], [], []
NOTE: crash recovery signalled OER-600
ERROR: ORA-600 signalled during mount of diskgroup DATA

仍然報錯,這時候提示[43], [8323]

根據提示,將aun4_blkn0.txt文件中的kfracdc.ckpt.seq修改爲需要的43

kfracdc.ckpt.seq:                    43 ; 0x018: 0x0000002f

再次merge到/dev/asm-diskc,/dev/asm-diskb...並MOUNT。

brucesong:/home/grid$kfed merge /dev/asm-diskc aun=4 text =aun4_blkn0.txt 
brucesong:/home/grid$kfed merge /dev/asm-diskb aun=5 blkn=0 text = aun4_blkn0.txt 
brucesong:/home/grid$

SQL>  alter diskgroup all mount;    

Diskgroup altered.    

SQL> 

已經成功MOUNT了。

ASM alert.log

Mon Mar 30 16:06:06 2020
NOTE: LGWR attempting to mount thread 1 for diskgroup 1 (DATA)
Process LGWR (pid 17239) is running at high priority QoS for Exadata I/O
NOTE: LGWR found thread 1 closed at ABA 43.8318
NOTE: LGWR mounted thread 1 for diskgroup 1 (DATA)
NOTE: LGWR opening thread 1 at fcn 0.66973 ABA 44.8319
NOTE: cache mounting group 1/0xFCE041E7 (DATA) succeeded
NOTE: cache ending mount (success) of group DATA number=1 incarn=0xfce041e7
Mon Mar 30 16:06:06 2020
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1
SUCCESS: diskgroup DATA was mounted
SUCCESS:  alter diskgroup all mount
Mon Mar 30 16:06:06 2020
NOTE: diskgroup resource ora.DATA.dg is online
Mon Mar 30 16:07:02 2020
Starting background process ASMB
Mon Mar 30 16:07:02 2020
ASMB started with pid=21, OS id=17332 
Mon Mar 30 16:07:02 2020
NOTE: client +ASM:+ASM registered, osid 17334, mbr 0x0
Mon Mar 30 16:07:03 2020
NOTE: client orcl:orcl registered, osid 17377, mbr 0x1

到這裏模擬的故障恢復了,但是我們知道實際生產環境中遇到的問題可能要比這個複雜,不一定能通過這個簡單的例子來恢復。
因爲ACD相當於Oracle DB裏的redo,既然是redo,Oracle就需要在斷電的時候用它來進行實例恢復的,那肯定要關聯到具體實例恢復的內容,而我模擬的案例裏面,其實是不存在具體恢復內容的,等遇到再研究。
另外隨便說一下,這種元數據損壞竟然用AMDU沒驗證出來。

brucesong:/home/grid/amdu_2020_03_30_16_51_07$cat report.txt 
-*-amdu-*-

******************************* AMDU Settings ********************************
ORACLE_HOME = /u01/11.2.0/grid
System name:    Linux
Node name:      brucesong
Release:        2.6.32-642.el6.x86_64
Version:        #1 SMP Wed Apr 13 00:51:26 EDT 2016
Machine:        x86_64
amdu run:       30-MAR-20 16:51:07
Endianess:      1

:::

************************** SCANNING DISKGROUP DATA ***************************
            Creation Time: 2019/10/30 04:59:48.287000
         Disks Discovered: 3
               Redundancy: 2
                  AU Size: 1048576 bytes
      Metadata Block Size: 4096 bytes
     Physical Sector Size: 512 bytes
          Metadata Stride: 113792 AU
   Duplicate Disk Numbers: 0


---------------------------- SCANNING DISK N0002 -----------------------------
Disk N0002: '/dev/asm-diskc'
           Allocated AU's: 2048
                Free AU's: 0
       AU's read for dump: 0
       Block images saved: 0
        Map lines written: 0
          Heartbeats seen: 0
  Corrupt metadata blocks: 0
        Corrupt AT blocks: 0


---------------------------- SCANNING DISK N0003 -----------------------------
Disk N0003: '/dev/asm-diskd'
           Allocated AU's: 1480
                Free AU's: 568
       AU's read for dump: 0
       Block images saved: 0
        Map lines written: 0
          Heartbeats seen: 0
  Corrupt metadata blocks: 0
        Corrupt AT blocks: 0


---------------------------- SCANNING DISK N0001 -----------------------------
Disk N0001: '/dev/asm-diskb'
           Allocated AU's: 2048
                Free AU's: 0
       AU's read for dump: 0
       Block images saved: 0
        Map lines written: 0
          Heartbeats seen: 0
  Corrupt metadata blocks: 0
        Corrupt AT blocks: 0


------------------------- SUMMARY FOR DISKGROUP DATA -------------------------
           Allocated AU's: 5576
                Free AU's: 568
       AU's read for dump: 0
       Block images saved: 0
        Map lines written: 0
          Heartbeats seen: 0
  Corrupt metadata blocks: 0
        Corrupt AT blocks: 0
:::


******************************* END OF REPORT ********************************
brucesong:/home/grid/amdu_2020_03_30_16_51_07$

The End..

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章