存儲Cache 丟失導致數據庫無法open的案例

原文鏈接：http://www.killdb.com/2017/09/27/%e5%ad%98%e5%82%a8cache-%e4%b8%a2%e5%a4%b1%e5%af%bc%e8%87%b4%e6%95%b0%e6%8d%ae%e5%ba%93%e6%97%a0%e6%b3%95open%e7%9a%84%e6%a1%88%e4%be%8b/

最近某客戶的一套核心數據庫由於存儲問題導致清掉Cache之後無法啓動。首先我們來看看數據庫在啓動的時候報什麼錯誤：

Thu Sep 21 19:35:55 2017

WARNING: Write Failed. group:1 disk:3 AU:53436 offset:95744 size:1024

Errors in file /u01/app/oracle/diag/rdbms/ods/xxx2/trace/xxx2_lgwr_14636.trc:

ORA-15080: synchronous I/O operation to a disk failed

WARNING: failed to write mirror side 1 of virtual extent 43 logical extent 0 of file 265 in group 1 on disk 3 allocation unit 53436

Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx2/trace/xxx2_lgwr_14636.trc:

ORA-00345: redo log write error block 88251 count 2

ORA-00312: online log 3 thread 2: '+DATA/xxx/onlinelog/group_3.265.816035881'

ORA-15081: failed to submit an I/O operation to a disk

Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx2/trace/xxx2_lgwr_14636.trc:

ORA-00340: IO error processing online log 3 of thread 2

ORA-00345: redo log write error block 88251 count 2

ORA-00312: online log 3 thread 2: '+DATA/xxx/onlinelog/group_3.265.816035881'

ORA-15081: failed to submit an I/O operation to a disk

LGWR (ospid: 14636): terminating the instance due to error 340

錯誤並不複雜。可以看到Oracle這裏已經無法正常寫Redo logfile了。由於這套數據庫是非歸檔，只有邏輯備份，因此即使恢復成功也面臨數據丟失的可能性。首先我在嘗試進行恢復時，發現居然無法mount數據庫，在mount過程中實例被直接終止了，感覺非常奇怪。也沒有報非常明顯的錯誤。mount過程出錯，那麼無疑是controlfile存在異常；由於沒有controlfile備份，因此這裏先手工重建控制文件，如下是腳本:

CREATE CONTROLFILE REUSE DATABASE "XXX" RESETLOGS NOARCHIVELOG

MAXLOGFILES 16

MAXLOGMEMBERS 3

MAXDATAFILES 100

MAXINSTANCES 8

MAXLOGHISTORY 584

LOGFILE

GROUP 1 '+data/ods/ONLINELOG/group_1.257.816033845' SIZE 500M BLOCKSIZE 512,

GROUP 2 '+data/xxx/ONLINELOG/group_2.258.816033845' SIZE 500M BLOCKSIZE 512,

GROUP 3 '+data/xxx/ONLINELOG/group_3.265.816035881' SIZE 500M BLOCKSIZE 512,

GROUP 4 '+data/xxx/ONLINELOG/group_4.266.816035883' SIZE 500M BLOCKSIZE 512,

GROUP 5 '+data/xxx/ONLINELOG/group_5.275.816036347' SIZE 500M BLOCKSIZE 512,

GROUP 6 '+data/xxx/ONLINELOG/group_6.277.816036359' SIZE 500M BLOCKSIZE 512

DATAFILE

'+DATA/xxx/datafile/system.259.816033847',

'+DATA/xxx/datafile/sysaux.260.816033849',

'+DATA/xxx/datafile/undotbs1.261.816033851',

'+DATA/xxx/datafile/undotbs2.263.816033859',

'+DATA/xxx/datafile/users.264.816033859',

'+DATA/xxx/datafile/tbs_tbdata.278.816036381',

'+DATA/xxx/datafile/tbs_omdata.283.816036779',

'+DATA/xxx/datafile/tbs_cmdata.284.816036813',

'+DATA/xxx/datafile/tbs_dmdata.285.816036857',

'+DATA/xxx/datafile/tbs_dbetl.286.816036893',

'+DATA/xxx/datafile/tbs_schedule.287.816036909',

'+DATA/xxx/datafile/tbs_meast.288.816036915',

'+DATA/xxx/datafile/tbs_m1104.289.816036939',

'+DATA/xxx/datafile/tbs_mpisa.293.842192725',

'+DATA/xxx/datafile/tbs_mpfsc',

'+DATA/xxx/datafile/tbs_msafe',

'+DATA/xxx/datafile/tbs_mecsp',

'+DATA/xxx/datafile/tbs_mpbss',

'+DATA/xxx/datafile/tbs_mpbfc',

'+DATA/xxx/datafile/idx_cmdata'

CHARACTER SET ZHS16GBK;

重建完畢後。其實這裏我首先嚐試了進行noresetlogs創建，但是發現報錯：

1 2	ORA-00600: internal error code, arguments: [2762], [4294967295], [1024000], [+DATA/xxx/onlinelog/group_3.265.816035881], [], [], [], [], [], [], [], []

很明顯，Redo logfile有問題；看來還是隻能Resetlogs方式創建。創建完畢之後，嘗試進行了recover database using backup controlfile until cancel恢復操作；然後通過隱含參數強制open發現還是有如下錯誤：

Fri Sep 22 13:00:10 2017

SMON: enabling cache recovery

Instance recovery: looking for dead threads

Instance recovery: lock domain invalid but no dead threads

Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_1593.trc (incident=120288):

ORA-00600: internal error code, arguments: [2662], [3], [3158008565], [3], [3159337219], [12582976], [], [], [], [], [], []

Incident details in: /u01/app/oracle/diag/rdbms/xxx/xxx1/incident/incdir_120288/xxx1_ora_1593_i120288.trc

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_1593.trc:

ORA-00600: internal error code, arguments: [2662], [3], [3158008565], [3], [3159337219], [12582976], [], [], [], [], [], []

Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_1593.trc:

ORA-00600: internal error code, arguments: [2662], [3], [3158008565], [3], [3159337219], [12582976], [], [], [], [], [], []

Error 600 happened during db open, shutting down database

USER (ospid: 1593): terminating the instance due to error 600

Instance terminated by USER, pid = 1593

ORA-1092 signalled during: alter database open resetlogs...

這是非常經典的錯誤了；由於這是scn的問題；而且數據庫版本爲11.2.0.3.0，未安裝任何psu。因此這裏是可以直接推進scn的。

直接通過10015 event來推進數據庫的scn；另外由於是異常關機，那麼這裏Undo必然也無法進行正常恢復；因此同時設置undo_management參數爲manual，並同時設置10015 event：

alter session set events ‘10015 trace name adjust_scn level 2’;

順利打開了數據庫。打開數據庫之後立刻重建數據庫Undo和temp，如下：

create undo tablespace undo1 datafile '+data' size 2048m;

create undo tablespace undo2 datafile '+data' size 2048m;

drop tablespace undotbs1 including contents and datafiles;

drop tablespace undotbs2 including contents and datafiles;

alter tablespace temp add tempfile '+data/rep/tempfile/TEMP.276.816036349' reuse;

alter tablespace temp add tempfile '+data/ods/tempfile/temp1607202' reuse;

alter tablespace temp add tempfile '+data/ods/tempfile/temp1607203' reuse;

再次重啓數據庫之後，發現alert log仍然有一些錯誤；這是在所難免的。如下所示：

ORA-00604: error occurred at recursive SQL level 1

ORA-08102: index key not found, obj# 290, file 1, block 1029 (2)

ORA-12012: error on auto execute of job 4001

ORA-08102: index key not found, obj# 290, file 1, block 1029 (2)

Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_j003_8160.trc:

ORA-00604: error occurred at recursive SQL level 1

ORA-08102: index key not found, obj# 290, file 1, block 1029 (2)

ORA-12012: error on auto execute of job 4002

ORA-08102: index key not found, obj# 290, file 1, block 1029 (2)

Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_8043.trc:

Fri Sep 22 13:30:40 2017

Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_8043.trc:

Fri Sep 22 13:30:42 2017

Dumping diagnostic data in directory=[cdmp_20170922133042], requested by (instance=1, osid=8043), summary=[abnormal process termination].

Fri Sep 22 13:31:59 2017

Starting background process SMCO

Fri Sep 22 13:31:59 2017

SMCO started with pid=35, OS id=9375

Fri Sep 22 13:37:54 2017

Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_m000_10623.trc (incident=144379):

ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []

Incident details in: /u01/app/oracle/diag/rdbms/xxx/xxx1/incident/incdir_144379/xxx1_m000_10623_i144379.trc

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Fri Sep 22 13:37:55 2017

Dumping diagnostic data in directory=[cdmp_20170922133755], requested by (instance=1, osid=10623 (M000)), summary=[incident=144379].

Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_m000_10623.trc (incident=144380):

ORA-00600: internal error code, arguments: [kewrose_1], [600], [ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []

], [], [], [], [], [], [], [], [], []

Incident details in: /u01/app/oracle/diag/rdbms/xxx/xxx1/incident/incdir_144380/xxx1_m000_10623_i144380.trc

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_m000_10623.trc:

ORA-00600: internal error code, arguments: [kewrose_1], [600], [ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []

], [], [], [], [], [], [], [], [], []

Dumping diagnostic data in directory=[cdmp_20170922133757], requested by (instance=1, osid=10623 (M000)), summary=[incident=144380].

Fri Sep 22 13:37:58 2017

Sweep [inc][144380]: completed

實際上當時在進行恢復時，我手工處理掉了obj＃ 290；但是進一步檢查發現obj$,col_usage$ ,i_obj4# 都存在問題。而且不一致的記錄還比較多：

100

101

102

103

104

select /*+ index(t i_obj4) */ DATAOBJ#,type#,owner# from obj$ t

minus

select /*+ full(t1) */ DATAOBJ#,type#,owner# from obj$ t1;

DATAOBJ# TYPE# OWNER#

---------- ---------- ----------

1451154 2 90

1589557 1 92

1589558 2 92

1589573 2 100

1589574 2 100

1589575 2 100

1589576 2 100

1589577 2 100

1589578 2 100

1589579 2 100

1589580 2 100

1589581 2 100

1589582 2 100

1589583 2 100

1589584 2 100

1589585 2 100

1589586 2 100

1589587 2 100

1589588 2 100

1589589 2 100

1589590 2 100

1589591 2 100

1589592 2 100

1589593 2 100

1589594 2 100

1589595 2 100

1589596 2 100

1589597 2 100

1589598 2 100

1589599 2 100

1589600 2 100

1589601 2 100

1589602 2 100

1589603 2 100

1589604 2 100

1589607 2 100

1589612 0 0

select /*+ full(t1) */ DATAOBJ#,type#,owner# from obj$ t1

minus

select /*+ index(t i_obj4) */ DATAOBJ#,type#,owner# from obj$ t ；

DATAOBJ# TYPE# OWNER#

---------- ---------- ----------

1587659 2 100

1587660 2 100

1587661 2 100

1587662 2 100

1587663 2 100

1587664 2 100

1587665 2 100

1587666 2 100

1587667 2 100

1587668 2 100

1587669 2 100

1587670 2 100

1587671 2 100

1587672 2 100

1587673 2 100

1587674 2 100

1587675 2 100

1587676 2 100

1587677 2 100

1587678 2 100

1587679 2 100

1587680 2 100

1587681 2 100

1587682 2 100

1587683 2 100

1587684 2 100

1587685 2 100

1587687 2 100

1587688 2 100

1587689 2 100

1587690 2 100

1587691 2 100

1587692 2 100

1587695 1 92

1587696 2 92

1587716 2 90

1587717 2 90

1587718 2 90

1587719 2 90

1587720 2 90

1587721 2 90

1587722 2 90

1587723 2 90

1587724 2 90

1587725 2 90

1587726 2 90

1587727 2 90

1587728 2 90

1587729 2 90

1587730 2 90

1587732 2 90

1587733 2 90

1589527 0 0

最開始我還嘗試通過bbed修復了2個Block；最後發現依然難以處理這個ora-08102錯誤；後續通過上述sql比較發現居然有如此多的記錄不一致。修改起來太過麻煩了。

這裏其實本來想嘗試通過重建obj$,i_obj4$,col_usage$ 來解決的。但是擔心有較大的風險,因此這裏建議可以進行了數據庫重建。由於obj$這裏有問題，expdp操作都報錯，無法執行任何ddl操作。因此最好通過exp拆分腳本來進行重建處理。整個數據庫恢復➕重建過程將近20小時左右(2tb左右的庫).

由於客戶存儲環境io較差，因此導致整個重建過程比較複雜，比較耗時。我們在開玩笑講到，如果可能的數據庫運行在我們的Zdata環境上，那麼數據庫重建過程在2小時內即可完成；而且也不會出現類似故障；因此Zdata的io操作上直接落盤或者寫到Pcie上；不存在數據丟失的風險。

最後補充一點：

1) 由於數據庫很多事務無法正常恢復，導致SMON在不斷嘗試進行事務恢復時報錯，達到一定次數之後會crash實例；進而影響數據庫的重建工作。可通過設置_smon_internal_errlimit 參數來避免該問題。

2) 爲了加快exp和imp速度，這裏我們利用了管道技術，腳本如下：

mknod /backup/omdata10

exp \'/ as sysdba\' parfile=omdata.par file=/backup/omdata10 rows=y indexes=n compress=n direct=y recordlength=65536 buffer=52488000 feedback=100000 volsize=0 log=omdata_other.log;

imp \'/ as sysdba\' file=/backup/omdata10 fromuser=omdata touser=omdata buffer=52488000

存儲Cache 丟失導致數據庫無法open的案例

Shell/Python中的用戶名獲取

datafile 也能跨resetlogs ？

log file sync等待超高一例

Oracle中如何恢復被刪掉的存儲過程？

15 TB 3節點rac 恢復記錄

11203 RAC(asm)恢復一例

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結