磁盤組offline和multipath分析

好的,又一個庫掛了

查看集羣資源

[grid@jcsjsjk02 ~]$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCHDG.dg
               ONLINE  ONLINE       lzl01                                    
               OFFLINE OFFLINE      lzl02       
                             
ora.DATA2.dg
               ONLINE  ONLINE       lzl01                                    
               ONLINE  ONLINE       lzl02                                    
ora.DATADG.dg
               ONLINE  ONLINE       lzl01                                    
               ONLINE  ONLINE       lzl02                                    
ora.LISTENER.lsnr
               ONLINE  ONLINE       lzl01                                    
               ONLINE  ONLINE       lzl02                                    
ora.OCRDG.dg
               ONLINE  ONLINE       lzl01                                    
               ONLINE  ONLINE       lzl02                                    
ora.asm
               ONLINE  ONLINE       lzl01                Started             
               ONLINE  ONLINE       lzl02                Started             
ora.gsd
               OFFLINE OFFLINE      lzl01                                    
               OFFLINE OFFLINE      lzl02                                    
ora.net1.network
               ONLINE  ONLINE       lzl01                                    
               ONLINE  ONLINE       lzl02                                    
ora.ons
               ONLINE  ONLINE       lzl01                                    
               ONLINE  ONLINE       lzl02                                    
ora.registry.acfs
               ONLINE  ONLINE       lzl01                                    
               ONLINE  ONLINE       lzl02                                    
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       lzl01                                    
ora.cvu
      1        ONLINE  ONLINE       lzl01                                    
ora.epmjc.db
      1        OFFLINE OFFLINE                               Corrupted Controlfi 
                                                             le                  
      2        OFFLINE OFFLINE                               Corrupted Controlfi 
                                                             le     
             
ora.lzl01.vip
      1        ONLINE  ONLINE       lzl01                                    
ora.lzl02.vip
      1        ONLINE  ONLINE       lzl02                                    
ora.oc4j
      1        ONLINE  ONLINE       lzl01                                    
ora.scan1.vip
      1        ONLINE  ONLINE       lzl01 
      
   從集羣資源看出, 數據庫和磁盤狀態有問題

查看multipath -ll
 | `- 7:0:2:5   sdg   8:96     active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 8:0:2:5   sdpg  130:352  active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 0:0:2:5   sdaay 133:608  active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 1:0:2:5   sdahy 128:960  active ready running
mpathf (36005076305ffd7840000000000002107) dm-9 IBM,2107900
size=700G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 7:0:3:8   sdp   8:240    active ready running
  |- 7:0:4:8   sdcz  70:112   active ready running
  |- 8:0:0:8   sdip  135:144  active ready running
  |- 8:0:1:8   sdlz  69:272   active ready running
  |- 0:0:1:8   sdsb  134:496  active ready running
  |- 0:0:0:8   sdxr  128:528  active ready running
  |- 1:0:0:8   sdabh 133:752  active ready running
  `- 1:0:1:8   sdaer 67:880   active ready running
mpathda (36005076305ffd7840000000000002054) dm-30 IBM,2107900
size=700G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 7:0:3:17  sdy   65:128   active ready running
  |- 7:0:4:17  sddi  71:0     active ready running
  |- 8:0:0:17  sdiy  8:288    active ready running
  `- 8:0:1:17  sdmi  69:416   active ready running
mpathfk (3600b342506bf3ebdb842d43fcd0000d9) dm-134 MacroSAN,LU
size=1.1T features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 7:0:5:25  sdha  133:0    active ready running
     多路徑有問題,有些是8鏈路有些又是4鏈路

 

查看磁盤組狀態
      SQL> select GROUP_NUMBER,DISK_NUMBER,MOUNT_STATUS,HEADER_STATUS,PATH from v$asm_disk order by 1,2;
select GROUP_NUMBER,DISK_NUMBER,MOUNT_STATUS,HEADER_STATUS,PATH from v$asm_disk order by 1,2
                                                                     *
ERROR at line 1:
ORA-00600: internal error code, arguments: [kfdskResolveDisk1], [], [], [], [],
[], [], [], [], [], [], []

磁盤組直接報錯

 

查看message信息

ACT EXTSCYD starting.
Jul  8 15:53:05 lzl02 Oracle GoldenGate Capture for Oracle[185015]: 2019-07-08 15:53:05  ERROR   OGG-00664  Oracle GoldenGate Capture for Oracle, extscya.prm:  OCI Error beginning session (status = 1034-ORA-01034: ORACLE not available#012ORA-27101: shared memory realm does not exist#012Linux-x86_64 Error: 2: No such file or directory).

查看告警日誌
Mon Jul 08 15:42:41 2019
Errors in file /oracle/app/oracle/diag/rdbms/lzldb/lzldb2/trace/lzldb2_ora_97643.trc:
ORA-00317: file type 47077 in header is not log file
ORA-00334: archived log: '+DATADG/lzldb/onlinelog/group_3_1'
Errors in file /oracle/app/oracle/diag/rdbms/lzldb/lzldb2/trace/lzldb2_ora_28387.trc  (incident=797121):
ORA-00600: internal error code, arguments: [2618], [8], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/oracle/diag/rdbms/lzldb/lzldb2/incident/incdir_797121/lzldb2_ora_28387_i797121.trc
Errors in file /oracle/app/oracle/diag/rdbms/lzldb/lzldb2/trace/lzldb2_ora_97643.trc  (incident=796969):
ORA-00600: internal error code, arguments: [2618], [8], [], [], [], [], [], [], [], [], [], []
Mon Jul 08 15:42:41 2019
Hex dump of (file 32, block 3750338) in trace file /oracle/app/oracle/diag/rdbms/lzldb/lzldb2/trace/lzldb2_ora_112234.trc
Incident details in: /oracle/app/oracle/diag/rdbms/lzldb/lzldb2/incident/incdir_796969/lzldb2_ora_97643_i796969.trc
Corrupt block relative dba: 0x083939c2 (file 32, block 3750338)
Bad header found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x05682d42
 last change scn: 0x0cf2.e932bbbc seq: 0x1 flg: 0x06
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0xbbbc0601
 check value in block header: 0x1691
 computed block checksum: 0x0
Reading datafile '+DATADG/lzldb/datafile/cachedata03.dbf' for corruption at rdba: 0x083939c2 (file 32, block 3750338)
Reread (file 32, block 3750338) found same corrupt data (no logical check)
Errors in file /oracle/app/oracle/diag/rdbms/lzldb/lzldb2/trace/lzldb2_ora_112234.trc  (incident=798393):
ORA-01578: ORACLE 數據塊損�(文件�32, 塊號 3750338)
ORA-01110: 數據文件 32: '+DATADG/lzldb/datafile/cachedata03.dbf'
Incident details in: /oracle/app/oracle/diag/rdbms/lzldb/lzldb2/incident/incdir_798393/lzldb2_ora_112234_i798393.trc
Mon Jul 08 15:42:41 2019
Hex dump of (file 1, block 118639) in trace file /oracle/app/oracle/diag/rdbms/lzldb/lzldb2/trace/lzldb2_smon_80667.trc
Corrupt block relative dba: 0x0041cf6f (file 1, block 118639)
Bad header found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x006e106f
 last change scn: 0x0e6a.1dbc963d seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x963d0601
 check value in block header: 0x19d0
 computed block checksum: 0x0
Reading datafile '+DATADG/lzldb/datafile/system.259.864741115' for corruption at rdba: 0x0041cf6f (file 1, block 118639)
Reread (file 1, block 118639) found same corrupt data (no logical check)
Hex dump of (file 0, block 1) in trace file /oracle/app/oracle/diag/rdbms/lzldb/lzldb2/trace/lzldb2_smon_80667.trc
Corrupt block relative dba: 0x00000001 (file 0, block 1)
Bad header found during control file header read
Data in bad block:
 type: 1 format: 2 rdba: 0x0014085f
 last change scn: 0x809c.001d382a seq: 0xfb flg: 0xd4
 spare1: 0x0 spare2: 0x0 spare3: 0x60
 consistency value in tail: 0x001234c1
 check value in block header: 0x2
 computed block checksum: 0x0
Mon Jul 08 15:42:42 2019
Incomplete read from log member '+DATADG/lzldb/onlinelog/group_3_1'. Trying next member.
Mon Jul 08 15:42:42 2019
Incomplete read from log member '+DATADG/lzldb/onlinelog/group_3_1'. Trying next member.
Mon Jul 08 15:42:42 2019
Incomplete read from log member '+DATADG/lzldb/onlinelog/group_3_1'. Trying next member.

很多磁盤和io類的報錯

asm alert日誌:
Mon Jul 08 15:44:33 2019
NOTE: ASM client lzldb2:lzldb disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_80677.trc
Mon Jul 08 15:45:34 2019
WARNING: cache read  a corrupt block: group=1(ARCHDG) fn=1 blk=7 disk=0 (ASMDISK1) incarn=3916141130 au=2 blk=7 count=1
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_pz99_149004.trc:
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [1] [7] [0 != 1]
NOTE: a corrupted block from group ARCHDG was dumped to /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_pz99_149004.trc
WARNING: cache read (retry) a corrupt block: group=1(ARCHDG) fn=1 blk=7 disk=0 (ASMDISK1) incarn=3916141130 au=2 blk=7 count=1
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_pz99_149004.trc:
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [1] [7] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [1] [7] [0 != 1]
ERROR: cache failed to read group=1(ARCHDG) fn=1 blk=7 from disk(s): 0(ASMDISK1)
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [1] [7] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [1] [7] [0 != 1]
NOTE: cache initiating offline of disk 0 group ARCHDG
NOTE: process _pz99_+asm2 (149004) initiating offline of disk 0.3916141130 (ASMDISK1) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 0/0xe96b924a, mask = 0x6a, op = clear
Mon Jul 08 15:45:34 2019
GMON updating disk modes for group 1 at 13 for pid 61, osid 149004
ERROR: Disk 0 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 1)
Mon Jul 08 15:45:34 2019
NOTE: cache dismounting (not clean) group 1/0xE52B6129 (ARCHDG) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 174610, image: oracle@lzl02 (B000)
Mon Jul 08 15:45:34 2019
NOTE: halting all I/Os to diskgroup 1 (ARCHDG)
Mon Jul 08 15:45:34 2019
NOTE: LGWR doing non-clean dismount of group 1 (ARCHDG)
NOTE: LGWR sync ABA=21221.7871 last written ABA 21221.7871
WARNING: Offline for disk ASMDISK1 in mode 0x7f failed.
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_pz99_149004.trc  (incident=2215850):
ORA-15335: ASM metadata corruption detected in disk group 'ARCHDG'
ORA-15130: diskgroup "ARCHDG" is being dismounted
ORA-15066: offlining disk "ASMDISK1" in group "ARCHDG" may result in a data loss
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [1] [7] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [1] [7] [0 != 1]
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM2/incident/incdir_2215850/+ASM2_pz99_149004_i2215850.trc
Mon Jul 08 15:45:34 2019
kjbdomdet send to inst 1
detach from dom 1, sending detach message to inst 1
Mon Jul 08 15:45:34 2019
List of instances:
 1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 20)
 Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 1 invalid = TRUE 
 3514 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Mon Jul 08 15:45:35 2019
WARNING: dirty detached from domain 1
NOTE: cache dismounted group 1/0xE52B6129 (ARCHDG) 
SQL> alter diskgroup ARCHDG dismount force /* ASM SERVER:3844825385 */ 
Mon Jul 08 15:45:35 2019
Dumping diagnostic data in directory=[cdmp_20190708154535], requested by (instance=2, osid=149004 (PZ99)), summary=[incident=2215850].
Mon Jul 08 15:45:35 2019
Sweep [inc][2215850]: completed
System State dumped to trace file /oracle/app/grid/diag/asm/+asm/+ASM2/incident/incdir_2215850/+ASM2_pz99_149004_i2215850.trc
Mon Jul 08 15:45:35 2019
NOTE: cache deleting context for group ARCHDG 1/0xe52b6129
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_pz99_149004.trc  (incident=2215851):
ORA-15335: ASM metadata corruption detected in disk group 'ARCHDG'
ORA-15130: diskgroup "ARCHDG" is being dismounted
ORA-15066: offlining disk "ASMDISK1" in group "ARCHDG" may result in a data loss
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [1] [7] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [1] [7] [0 != 1]
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM2/incident/incdir_2215851/+ASM2_pz99_149004_i2215851.trc
GMON dismounting group 1 at 14 for pid 27, osid 174610
NOTE: Disk ASMDISK1 in mode 0x7f marked for de-assignment
NOTE: Disk ASMDISK2 in mode 0x7f marked for de-assignment
NOTE: Disk ASMDISK18 in mode 0x7f marked for de-assignment
NOTE: Disk ASMDISK19 in mode 0x7f marked for de-assignment
NOTE: Disk ASMDISK51 in mode 0x7f marked for de-assignment
NOTE: Disk ASMDISK50 in mode 0x7f marked for de-assignment
NOTE: Disk ASMDISK100 in mode 0x7f marked for de-assignment
NOTE: Disk ASMDISK101 in mode 0x7f marked for de-assignment
NOTE: Disk ASMDISK102 in mode 0x7f marked for de-assignment
NOTE: Disk ASMMACDISK26 in mode 0x7f marked for de-assignment
NOTE: Disk ASMMACDISK27 in mode 0x7f marked for de-assignment
SUCCESS: diskgroup ARCHDG was dismounted
SUCCESS: alter diskgroup ARCHDG dismount force /* ASM SERVER:3844825385 */
Mon Jul 08 15:45:36 2019
NOTE: diskgroup resource ora.ARCHDG.dg is offline
SUCCESS: ASM-initiated MANDATORY DISMOUNT of group ARCHDG
Mon Jul 08 15:45:36 2019
Sweep [inc][2215851]: completed
Mon Jul 08 15:45:36 2019
Sweep [inc2][2215850]: completed
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_pz99_149004.trc  (incident=2215852):
ORA-15335: ASM metadata corruption detected in disk group ''
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM2/incident/incdir_2215852/+ASM2_pz99_149004_i2215852.trc
Dumping diagnostic data in directory=[cdmp_20190708154536], requested by (instance=2, osid=149004 (PZ99)), summary=[incident=2215851].
Mon Jul 08 15:45:37 2019
Sweep [inc][2215852]: completed
Dumping diagnostic data in directory=[cdmp_20190708154537], requested by (instance=2, osid=149004 (PZ99)), summary=[incident=2215852].
Dumping diagnostic data in directory=[cdmp_20190708154538], requested by (instance=1, osid=66351), summary=[incident=2215833].
Mon Jul 08 15:46:36 2019
Sweep [inc2][2215852]: completed
Sweep [inc2][2215851]: completed
Mon Jul 08 15:48:52 2019
NOTE: AMDU dump of disk group ARCHDG created at /oracle/app/grid/diag/asm/+asm/+ASM2/incident/incdir_2215850
Mon Jul 08 16:14:23 2019
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_211166.trc  (incident=2215578):
ORA-00600: internal error code, arguments: [kfdskResolveDisk1], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM2/incident/incdir_2215578/+ASM2_ora_211166_i2215578.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Jul 08 16:14:24 2019
Dumping diagnostic data in directory=[cdmp_20190708161424], requested by (instance=2, osid=211166), summary=[incident=2215578].
Mon Jul 08 16:14:25 2019
Sweep [inc][2215578]: completed
Sweep [inc2][2215578]: completed

asm報錯且有強制offline archdg動作

判斷鏈路可能有問題,主機工程師幫忙修復鏈路問題

然後查看2個節點多路徑個數

[root@lzl01 fc_host]# multipath -ll|grep mpath|wc -l
129

[root@lzl02 fc_host]# multipath -ll|grep mpath|wc -l
129

多路徑個數一致,然後啓庫仍然失敗,2節點磁盤組仍然離線


對比2個節點多路徑綁定,發現有幾個盤符不一致

1節點:
mpathaw (36005076305ffd7840000000000002030) IBM,2107900
mpathb (36005076305ffd7840000000000002004) IBM,2107900

2節點:

mpathaw (36005076305ffd7840000000000002030) IBM,2107900
mpathb (36005076305ffd7840000000000002006) IBM,2107900

mpathb盤符不一致,但是發現/etc/multipath.conf裏面的是一致的,這就比較奇怪了

爲什麼2節點的多路徑配置沒有生效?

於是刪除多路徑配置,重置多路徑軟件,然後重啓多路徑,終於一致了

asmlib重新掃盤

[root@lzl02 fc_host]# oracleasm scandisks

重新掛載2節點磁盤組

[grid@lzl02 fc_host]#

srvctl start diskgroup -g archdg

啓動數據庫

[oracle@lzl02 fc_host]#srvctl start database -d lzl

ok!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章