今天客戶反饋一個工程庫遇到了ORA-00600: internal error code, arguments: [kfrValAcd30]錯誤
這將導致磁盤組無法MOUNT,如果磁盤組無法MOUNT,數據庫也就無法正常OPEN。
客戶反饋說之前現網也多次遇到該錯誤,通常的恢復辦法都是通過AMDU把文件抽取出來重新創建磁盤組,而這次工程庫是新建的庫,所以客戶索性通過重建庫來避免該錯誤。
我們試想一下,對於一個上TB的庫,僅僅因爲一個元數據有問題而恢復或者重建,那工作量就有點大了,當然類似問題能快速修復的則快速修復,儘量避免AMDU或者數據庫恢復類的大動作發生,減少業務宕機時間。
下面是根據ROGER的思路,我自己模擬出ORA-00600[kfrValAcd30]故障,然後對其進行修復,希望下次我們再次遇到類似故障,能試着嘗試修復一下,看是否能通過此辦法進行修復。
首先對其錯誤簡單進行解釋一下:
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [40], [8319], [42], [8319], [], [], [], [], []
這個錯誤表示磁盤組在MOUNT時,ASM實例在進行實例恢復時讀取檢查點信息失敗導致,即是Active Change Diectory Checkpoint損壞導致。
NOTE: starting recovery of thread=1 ckpt=40.8319 group=1 (DATA)
Errors in file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_ora_17253.trc (incident=25737):
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [40], [8319], [42], [8319], [], [], [], [], []
猜測這裏對參數應該是:
[DATA], [1] <<< DISKGROUP_NAME,DISKGROUP_NUMBER;
[40], [8319]<<< 當前的seq,blk;
[42], [8319]<<< 需要的seq,blk;
因此屬於ASM ACD(Active Change Diectory)元數據發生損壞導致。關於ASM的ACD,COD等知識請查閱大師ROGER的文檔:
http://www.killdb.com/2013/01/16/oracle-asm%E5%89%96%E6%9E%90%E7%B3%BB%E5%88%977-active-change-directory/
環境:
DB: STAND ALONE Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
OS: Red Hat Enterprise Linux Server release 6.8 (Santiago)
故障重現
由於我們知道是ACD損壞,我們首先找到得找到ACD元數據所在的塊,ACD屬於ASM 3號文件,因此我們可以用下列SQL找到它的AU分佈細節。
SQL>
SQL> SELECT xnum_kffxp "Virtual extent",
2 pxn_kffxp "Physical extent",
3 au_kffxp "Allocation unit",
4 disk_kffxp "Disk"
5 FROM x$kffxp WHERE group_kffxp=1 -- Diskgroup 1 (DATA)
6 and number_kffxp=3
7 ORDER BY 1, 2;
Virtual extent Physical extent Allocation unit Disk
-------------- --------------- --------------- ----------
0 0 5 3
0 1 4 1
0 2 4294967294 65534
1 3 5 1
1 4 6 3
1 5 4294967294 65534
2 6 7 3
2 7 6 1
2 8 4294967294 65534
:::
41 123 46 1
41 124 46 3
41 125 4294967294 65534
2147483648 0 24 1
2147483648 1 47 3
2147483648 2 4294967294 65534
129 rows selected.
SQL>
從上面列出的內容可以看出,ACD第一個Virtual extent在disk3的AU5和disk1的AU4上。
SQL> col path for a20
SQL> select disk_number,path from v$asm_disk;
DISK_NUMBER PATH
----------- --------------------
0 /dev/asm-diskd
2
1 /dev/asm-diskc <<< disk 1 的路徑
3 /dev/asm-diskb <<< disk 3 的路徑
ASMCMD> lsdg
State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name
MOUNTED NORMAL Y 512 4096 1048576 4096 0 0 0 1 N DATA/
ASMCMD>
這裏DISKGROUP的冗餘模式爲NORMAL
我們先通過下列命令備份ACD的元數據,元數據在第一個BLOCK
brucesong:/home/grid$
brucesong:/home/grid$
brucesong:/home/grid$kfed read /dev/asm-diskc aun=4 blkn=0 text=aun4_blkn0.txt
修改aun4_blkn0.txt中的kfracdc.ckpt.seq爲一個較小的值,這裏原來是42,我修改爲40
brucesong:/home/grid$vi aun4_blkn0.txt
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 7 ; 0x002: KFBTYP_ACDC
kfbh.datfmt: 1 ; 0x003: 0x01
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 3 ; 0x008: file=3
kfbh.check: 1111673974 ; 0x00c: 0x4242cc76
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
kfracdc.eyec[0]: 65 ; 0x000: 0x41
kfracdc.eyec[1]: 67 ; 0x001: 0x43
kfracdc.eyec[2]: 68 ; 0x002: 0x44
kfracdc.eyec[3]: 67 ; 0x003: 0x43
kfracdc.thread: 1 ; 0x004: 0x00000001
kfracdc.lastAba.seq: 4294967295 ; 0x008: 0xffffffff
kfracdc.lastAba.blk: 4294967295 ; 0x00c: 0xffffffff
kfracdc.blk0: 1 ; 0x010: 0x00000001
kfracdc.blks: 10751 ; 0x014: 0x000029ff
kfracdc.ckpt.seq: 40 ; 0x018: 0x00000028 <<< 原來的值 kfracdc.ckpt.seq: 42 ; 0x018: 0x0000002a
kfracdc.ckpt.blk: 8319 ; 0x01c: 0x0000207f
kfracdc.fcn.base: 66973 ; 0x020: 0x0001059d
kfracdc.fcn.wrap: 0 ; 0x024: 0x00000000
kfracdc.bufBlks: 256 ; 0x028: 0x00000100
kfracdc.strt112.seq: 2 ; 0x02c: 0x00000002
kfracdc.strt112.blk: 0 ; 0x030: 0x00000000
~
~
~
"aun4_blkn0.txt" 27L, 1570C written
merge到/dev/asm-diskc,/dev/asm-diskb...
brucesong:/home/grid$kfed merge /dev/asm-diskc aun=4 text =aun4_blkn0.txt
brucesong:/home/grid$kfed merge /dev/asm-diskb aun=5 blkn=0 text = aun4_blkn0.txt
brucesong:/home/grid$
啓動ASM實例
SQL> startup
ASM instance started
Total System Global Area 1135747072 bytes
Fixed Size 2260728 bytes
Variable Size 1108320520 bytes
ASM Cache 25165824 bytes
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [40], <<< 可以看到報錯已經重現了
[8319], [42], [8319], [], [], [], [], []
ASM alert.log
NOTE: starting recovery of thread=1 ckpt=40.8319 group=1 (DATA)
Errors in file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_ora_17253.trc (incident=25737):
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [40], [8319], [42], [8319], [], [], [], [], []
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM/incident/incdir_25737/+ASM_ora_17253_i25737.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_ora_17253.trc:
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [40], [8319], [42], [8319], [], [], [], [], []
NOTE: crash recovery signalled OER-600
ERROR: ORA-600 signalled during mount of diskgroup DATA
故障已重現,接下來進行修復
根據提示,將aun4_blkn0.txt文件中的kfracdc.ckpt.seq修改爲需要的42
kfracdc.ckpt.seq: 42 ; 0x018: 0x0000002a
再次merge到/dev/asm-diskc,/dev/asm-diskb...
brucesong:/home/grid$kfed merge /dev/asm-diskc aun=4 text =aun4_blkn0.txt
brucesong:/home/grid$kfed merge /dev/asm-diskb aun=5 blkn=0 text = aun4_blkn0.txt
brucesong:/home/grid$
MOUNT DISKGROUP
SQL> alter diskgroup all mount;
alter diskgroup all mount
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [42],
[8319], [43], [8323], [], [], [], [], []
ASM alert.log
NOTE: starting recovery of thread=1 ckpt=42.8319 group=1 (DATA)
Errors in file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_ora_17253.trc (incident=25738):
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [42], [8319], [43], [8323], [], [], [], [], []
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM/incident/incdir_25738/+ASM_ora_17253_i25738.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_ora_17253.trc:
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [42], [8319], [43], [8323], [], [], [], [], []
NOTE: crash recovery signalled OER-600
ERROR: ORA-600 signalled during mount of diskgroup DATA
仍然報錯,這時候提示[43], [8323]
根據提示,將aun4_blkn0.txt文件中的kfracdc.ckpt.seq修改爲需要的43
kfracdc.ckpt.seq: 43 ; 0x018: 0x0000002f
再次merge到/dev/asm-diskc,/dev/asm-diskb...並MOUNT。
brucesong:/home/grid$kfed merge /dev/asm-diskc aun=4 text =aun4_blkn0.txt
brucesong:/home/grid$kfed merge /dev/asm-diskb aun=5 blkn=0 text = aun4_blkn0.txt
brucesong:/home/grid$
SQL> alter diskgroup all mount;
Diskgroup altered.
SQL>
已經成功MOUNT了。
ASM alert.log
Mon Mar 30 16:06:06 2020
NOTE: LGWR attempting to mount thread 1 for diskgroup 1 (DATA)
Process LGWR (pid 17239) is running at high priority QoS for Exadata I/O
NOTE: LGWR found thread 1 closed at ABA 43.8318
NOTE: LGWR mounted thread 1 for diskgroup 1 (DATA)
NOTE: LGWR opening thread 1 at fcn 0.66973 ABA 44.8319
NOTE: cache mounting group 1/0xFCE041E7 (DATA) succeeded
NOTE: cache ending mount (success) of group DATA number=1 incarn=0xfce041e7
Mon Mar 30 16:06:06 2020
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1
SUCCESS: diskgroup DATA was mounted
SUCCESS: alter diskgroup all mount
Mon Mar 30 16:06:06 2020
NOTE: diskgroup resource ora.DATA.dg is online
Mon Mar 30 16:07:02 2020
Starting background process ASMB
Mon Mar 30 16:07:02 2020
ASMB started with pid=21, OS id=17332
Mon Mar 30 16:07:02 2020
NOTE: client +ASM:+ASM registered, osid 17334, mbr 0x0
Mon Mar 30 16:07:03 2020
NOTE: client orcl:orcl registered, osid 17377, mbr 0x1
到這裏模擬的故障恢復了,但是我們知道實際生產環境中遇到的問題可能要比這個複雜,不一定能通過這個簡單的例子來恢復。
因爲ACD相當於Oracle DB裏的redo,既然是redo,Oracle就需要在斷電的時候用它來進行實例恢復的,那肯定要關聯到具體實例恢復的內容,而我模擬的案例裏面,其實是不存在具體恢復內容的,等遇到再研究。
另外隨便說一下,這種元數據損壞竟然用AMDU沒驗證出來。
brucesong:/home/grid/amdu_2020_03_30_16_51_07$cat report.txt
-*-amdu-*-
******************************* AMDU Settings ********************************
ORACLE_HOME = /u01/11.2.0/grid
System name: Linux
Node name: brucesong
Release: 2.6.32-642.el6.x86_64
Version: #1 SMP Wed Apr 13 00:51:26 EDT 2016
Machine: x86_64
amdu run: 30-MAR-20 16:51:07
Endianess: 1
:::
************************** SCANNING DISKGROUP DATA ***************************
Creation Time: 2019/10/30 04:59:48.287000
Disks Discovered: 3
Redundancy: 2
AU Size: 1048576 bytes
Metadata Block Size: 4096 bytes
Physical Sector Size: 512 bytes
Metadata Stride: 113792 AU
Duplicate Disk Numbers: 0
---------------------------- SCANNING DISK N0002 -----------------------------
Disk N0002: '/dev/asm-diskc'
Allocated AU's: 2048
Free AU's: 0
AU's read for dump: 0
Block images saved: 0
Map lines written: 0
Heartbeats seen: 0
Corrupt metadata blocks: 0
Corrupt AT blocks: 0
---------------------------- SCANNING DISK N0003 -----------------------------
Disk N0003: '/dev/asm-diskd'
Allocated AU's: 1480
Free AU's: 568
AU's read for dump: 0
Block images saved: 0
Map lines written: 0
Heartbeats seen: 0
Corrupt metadata blocks: 0
Corrupt AT blocks: 0
---------------------------- SCANNING DISK N0001 -----------------------------
Disk N0001: '/dev/asm-diskb'
Allocated AU's: 2048
Free AU's: 0
AU's read for dump: 0
Block images saved: 0
Map lines written: 0
Heartbeats seen: 0
Corrupt metadata blocks: 0
Corrupt AT blocks: 0
------------------------- SUMMARY FOR DISKGROUP DATA -------------------------
Allocated AU's: 5576
Free AU's: 568
AU's read for dump: 0
Block images saved: 0
Map lines written: 0
Heartbeats seen: 0
Corrupt metadata blocks: 0
Corrupt AT blocks: 0
:::
******************************* END OF REPORT ********************************
brucesong:/home/grid/amdu_2020_03_30_16_51_07$
The End..