Oracle坏块修复

 
数据库坏块(corruption) 的类型可以按照坏块所属对象的不同,分为用户数据坏块,数据字典坏块,Undo坏块,控制文件坏块,Redo坏块,Lob坏块,index坏块等等;也可以按照坏块产生的原因,分为物理坏块(physical corruption)和逻辑坏块(logical corruption )。

物理坏块(physical corruption)
常见的物理坏块(Physical Block Corruptions)有块头和块尾信息不一致(Fractured/Incomplete),checksum值无效,数据块信息全部为0等情况,并且可能伴随错误ORA-1578和ORA-1110

为了及时发现物理坏块和准确定位坏块产生的原因,oracle建议设置初始化参数DB_BLOCK_CHECKSUM=TYPICAL(默认值)。一般情况下,物理坏块是由于底层OS/disk系统错误/损坏,导致数据块被修改,数据块标志为坏块(corruption)。
数据块的Checksum值无效是一种常见的物理坏块,当数据库初始化参数DB_BLOCK_CHECKSUM=TYPICAL(默认值)时,DBWR进程将数据块写入disk时会计算数据块的Checksum,并且将Checksum值记录在数据块的位置offset 16和17;当从disk读取该数据块时,oracle重新计算数据块的Checksum,并且与记录在数据块中的Checksum做异或运算(Xor),如果异或结果为非0,说明数据块被修改过,数据块为坏块(corruption)。

一、坏块的产生原因:


1.硬件问题


Oracle进程在处理一个数据块时,首先将其读入物理内存空间,在处理完成后,再由特定进程将其写回磁盘;如果在这个过程中,出现内存故障,CPU计算失误,都会导致内存数据块的内容混乱,最后反映到写回磁盘的数据块内容有误。同样,如果存储子系统出现异常,数据块损坏也就随之出现了。


2.操作系统BUG


由于Oracle进程对数据块的读写,都是以操作系统内核调用(system call)的方式完成的,如果操作系统在内核调用存在问题,必然导致Oracle进程写入非法的内容。


3.操作系统的I/O错误或缓冲问题


4.内存或paging问题


Oracle软件BUG Oracle软件特定版本上,可能出现导致数据块的内容出现异常BUG。


5.非Oracle进程扰乱Oracle共享内存区域


在当数据块的内容被读入主机的物理内存时,如果其他非Oracle进程,对Oracle使用的共享内存区域形成了扰乱,最终导致写回磁盘的数据块内容混乱。


6.异常关机,掉电,终止服务


异常关机,掉电,终止服务使进程异常终止,而破坏数据块的完整性,导致坏块产生。
注:这也是为什么突然断电会导致数据库无法启动
由上可见,坏块的形成原因复杂。当出现坏块时,为了找到确切的原因,需要大量的分析时间和排查操作,甚至需要多次重现才能找出根本原因。但当故障发生在生产系统上,我们为了减少停机时间,会尽快实施应急权变措施以保证系统的可用性,这样就破坏了故障现场,对根本原因的分析因而也更加困难了。

二、坏块预防(检查)


1.对于Oracle bug问题引起的物理坏块问题,Oracle会对这些BUG以严重(Noticable)问题标出(标记为*或+)相应的patch。
2.使用 RMAN进行检查:
RMAN> BACKUP CHECK LOGICAL VALIDATE DATAFILE n ; --可以检查数据文件是否包含坏块,同时并不产生实际的备份输出。
3.使用dbv工具检查
ANALYZE TABLE tablename VALIDATE STRUCTURE CASCADE
它执行坏块的检查,但是不会标记坏块为corrupt,检测的结果保存在USER_DUMP_DEST目录下的用户trace文件中。
dbv file=d:\oracle\oradata\mydb\RONLY.DBF blocksize=8192
4.使用dbms包检查
①根据alert中的报错file_id和block_id查询对象
SELECT tablespace_name, segment_type, owner, segment_name
FROM dba_extents
WHERE file_id = &fileid
and &blockid between block_id AND block_id + blocks - 1;

-If V$DATABASE_BLOCK_CORRUPTION contains rows please run this query to find the objects that contains the corrupted blocks:
SELECT e.owner,
e.segment_type,
e.segment_name,
e.partition_name,
c.file#,
greatest(e.block_id, c.block#) corr_start_block#,
least(e.block_id + e.blocks - 1, c.block# + c.blocks - 1) corr_end_block#,
least(e.block_id + e.blocks - 1, c.block# + c.blocks - 1) -
greatest(e.block_id, c.block#) + 1 blocks_corrupted,
null description
FROM dba_extents e, v$database_block_corruption c
WHERE e.file_id = c.file#
AND e.block_id <= c.block# + c.blocks - 1
AND e.block_id + e.blocks - 1 >= c.block#
UNION
SELECT s.owner,
s.segment_type,
s.segment_name,
s.partition_name,
c.file#,
header_block corr_start_block#,
header_block corr_end_block#,
1 blocks_corrupted,
'Segment Header' description
FROM dba_segments s, v$database_block_corruption c
WHERE s.header_file = c.file#
AND s.header_block between c.block# and c.block# + c.blocks - 1
UNION
SELECT null owner,
null segment_type,
null segment_name,
null partition_name,
c.file#,
greatest(f.block_id, c.block#) corr_start_block#,
least(f.block_id + f.blocks - 1, c.block# + c.blocks - 1) corr_end_block#,
least(f.block_id + f.blocks - 1, c.block# + c.blocks - 1) -
greatest(f.block_id, c.block#) + 1 blocks_corrupted,
'Free Block' description
FROM dba_free_space f, v$database_block_corruption c
WHERE f.file_id = c.file#
AND f.block_id <= c.block# + c.blocks - 1
AND f.block_id + f.blocks - 1 >= c.block#
order by file#, corr_start_block#;

②给定一个表空间,并在此表空间下建立维修表:
BEGIN
 DBMS_REPAIR.ADMIN_TABLES (
 TABLE_NAME => 'REPAIR_TABLE',
 TABLE_TYPE => dbms_repair.repair_table,
 ACTION => dbms_repair.create_action,
 TABLESPACE => '&tablespace_name');
END;
/
③对指定的<schema>.<object>检查并确认其中坏块(如果同时指定PARTITION_NAME也可以进行分区级别检查):
set serveroutput on
DECLARE num_corrupt INT;
BEGIN
 num_corrupt := 0;
 DBMS_REPAIR.CHECK_OBJECT (
 SCHEMA_NAME => '&schema_name',
 OBJECT_NAME => '&object_name',
 REPAIR_TABLE_NAME => 'REPAIR_TABLE',
 corrupt_count => num_corrupt);
 DBMS_OUTPUT.PUT_LINE('number corrupt: ' || TO_CHAR (num_corrupt));
END;
/

5.利用exp工具导出整个数据库可以检测坏块
对以下情况的坏块是检测不出来的:
①HWM以上的坏块是不会发现的
②索引中存在的坏块是不会发现的
③数据字典中的坏块是不会发现的


三、修复方式


1. 当前数据库初始化参数配置DB_BLOCK_CHECKSUM=TYPICAL,因此从disk读取数据块时校验checksum:

SQL> show parameter DB_BLOCK_CHECKSUM

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_block_checksum                    string      TYPICAL

 

2. 查询表dept时发现有坏块,报错信息ORA-1578和ORA-1110,坏块为file # 4, block # 133

SQL> select * from dept;
 select * from dept
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 4, block # 133)
ORA-01110: data file 4: '/u01/app/oracle/oradata/orcl/users01.dbf'

 

3. 出现以上错误的同时在alert log中也有详细错误信息,这些错误信息说明数据块(file # 4, block # 133)损坏的原因是checksum无效。数据块中记录的checksum值为0x8167(这个值是上一次DBWR写入磁盘时计算的),读取数据块时重新计算得到的checksum是0x8122,checksum值异或运算(Xor)的结果是0x45 (computed block checksum)。由于两次checksum值不同(即异或结果为非0),说明数据块被修改过,数据块为坏块(corruption)。

Alert log错误信息:

Hex dump of (file 4, block 133) in trace file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_20892.trc
Corrupt block relative dba: 0x01000085 (file 4, block 133)
Bad check value found during multiblock buffer read  <<<<<<<<<<<<<< 说明坏块的原因是checksum无效
Data in bad block:
 type: 6 format: 2 rdba: 0x01000085
 last change scn: 0x0000.0023d69a seq: 0x5 flg: 0x06
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0xd69a0605
 check value in block header: 0x8167   <<<<<<<<<<<<<< 数据块中记录的checksum值为0x8167
 computed block checksum: 0x45         <<<<<<<<<<<<<< 0x8167与0x8122异或运算(Xor)的结果是0x45
Reading datafile '/u01/app/oracle/oradata/orcl/users01.dbf' for corruption at rdba: 0x01000085 (file 4, block 133)
Reread (file 4, block 133) found same corrupt data (no logical check)
Sun Mar 23 22:53:40 2014
Corrupt Block Found
         TSN = 4, TSNAME = USERS
         RFN = 4, BLK = 133, RDBA = 16777349
         OBJN = 14343, OBJD = 14343, OBJECT = DEPT, SUBOBJECT = 
         SEGMENT OWNER = JAMES, SEGMENT TYPE = Table Segment         <<<<<<<<<<<<<< 坏块对应的object ID
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_20892.trc  (incident=182595):
ORA-01578: ORACLE data block corrupted (file # 4, block # 133)
ORA-01110: data file 4: '/u01/app/oracle/oradata/orcl/users01.dbf'

4.1 对应的orcl_ora_20892.trc中也有数据块的信息,其中数据块上记录的checksum值是0x8167(chkval)

Block dump from disk:
buffer tsn: 4 rdba: 0x01000085 (4/133)
scn: 0x0000.0023d69a seq: 0x05 flg: 0x06 tail: 0xd69a0605
frmt: 0x02 chkval: 0x8167 type: 0x06=trans data
Hex dump of block: st=0, typ_found=1


4.2 通过dd也查看数据块中记录的checksum值, offset 16,17 对应的是checksum值0x8167

$ dd if=/u01/app/oracle/oradata/orcl/users01.dbf bs=8192 count=1 skip=133 of=/tmp/dd133.out

$ od -x /tmp/dd133.out
0000000 a206 0000 0085 0100 d69a 0023 0000 0605
0000020 8167 0000 0001 0000 3807 0000 2fef 000c
^^^^

解决方法:

5. 修复数据坏块的方法可以通过备份恢复或者DBMS_REPAIR.SKIP_CORRUPT_BLOCKS跳过坏块。
 方法1 RMAN数据块恢复:
首先要存在Rman的最新备份集,然后执行如下命令:

RMAN>backup validate datafile 4;

RMAN> run {blockrecover datafile 4 block 133;}

SQL> select * from dept;

    DEPTNO DNAME          LOC
---------- -------------- -------------
        10 ACCOUNTING     DALIAN
        20 RESEARCH       DALLAS
        30 SALES          CHICAGO
        40 OPERATIONS     BOSTON

 

方法2 bbed方法修改文件头


逻辑坏块:


alert日志报错:

Reading datafile '/oradata/datafiles/oadb/oa01.dbf' for corruption at rdba: 0x016d4dd5 (file 5, block 2969045)
Reread (file 5, block 2969045) found same corrupt data (no logical check)
Tue Aug 18 10:53:51 2015
Corrupt Block Found
        TSN = 6, TSNAME = OA
        RFN = 5, BLK = 2969045, RDBA = 23940565
        OBJN = 95690, OBJD = 95690, OBJECT = EDOC_BASE_WORKFLOW, SUBOBJECT = 
        SEGMENT OWNER = INSPUROA, SEGMENT TYPE = Table Segment
Tue Aug 18 10:55:03 2015
Hex dump of (file 5, block 2969045) in trace file /u01/app/oracle/diag/rdbms/oadb/oadb/trace/oadb_ora_4565.trc
Corrupt block relative dba: 0x016d4dd5 (file 5, block 2969045)
Bad header found during buffer read
Data in bad block:
 type: 117 format: 0 rdba: 0x20206b73
 last change scn: 0x2020.20202020 seq: 0x20 flg: 0x20
 spare1: 0x64 spare2: 0x69 spare3: 0x0
 consistency value in tail: 0x4d240601
 check value in block header: 0x5f49
 block checksum disabled
Reading datafile '/oradata/datafiles/oadb/oa01.dbf' for corruption at rdba: 0x016d4dd5 (file 5, block 2969045)
Reread (file 5, block 2969045) found same corrupt data (no logical check)
Tue Aug 18 10:55:03 2015
Corrupt Block Found
        TSN = 6, TSNAME = OA
        RFN = 5, BLK = 2969045, RDBA = 23940565
        OBJN = 95690, OBJD = 95690, OBJECT = EDOC_BASE_WORKFLOW, SUBOBJECT = 
        SEGMENT OWNER = INSPUROA, SEGMENT TYPE = Table Segment
Tue Aug 18 10:57:29 2015
Hex dump of (file 5, block 2969045) in trace file /u01/app/oracle/diag/rdbms/oadb/oadb/trace/oadb_ora_21708.trc
Corrupt block relative dba: 0x016d4dd5 (file 5, block 2969045)
Bad header found during buffer read
Data in bad block:
 type: 117 format: 0 rdba: 0x20206b73
 last change scn: 0x2020.20202020 seq: 0x20 flg: 0x20
 spare1: 0x64 spare2: 0x69 spare3: 0x0
 consistency value in tail: 0x4d240601
 check value in block header: 0x5f49
 block checksum disabled

执行修复 根据报错信息

方法1

Reading datafile '/oradata/datafiles/oadb/oa01.dbf' for corruption at rdba: 0x016d4dd5 (file 5, block 2969045)
Reread (file 5, block 2969045) found same corrupt data (no logical check)
Corrupt Block Found
        TSN = 6, TSNAME = OA
        RFN = 5, BLK = 2969045, RDBA = 23940565
        OBJN = 95690, OBJD = 95690, OBJECT = EDOC_BASE_WORKFLOW, SUBOBJECT = 
        SEGMENT OWNER = INSPUROA, SEGMENT TYPE = Table Segment

确定数据文件 datafile 5,oa01.dbf出现坏块现象
1.查看坏块信息:

SQL> select * from v$database_block_corruption;

    FILE#    BLOCK#    BLOCKS CORRUPTION_CHANGE# CORRUPTIO
---------- ---------- ---------- ------------------ ---------
        5    2969045          1                  0 CORRUPT

确定坏块为2969045号,检查备份日志(增量,全量)是否完整备份
2.检查备份datafile 5 是否完整

RMAN> backup validate datafile 5;

Starting backup at 18-AUG-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=982 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00005 name=/oradata/datafiles/oadb/oa01.dbf
channel ORA_DISK_1: backup set complete, elapsed time: 00:05:35
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
5    FAILED 0              1840        4190720        9484751217293
  File Name: /oradata/datafiles/oadb/oa01.dbf
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data      0              2842014        
  Index      0              182983          
  Other      1              1163883       

validate found one or more corrupt blocks
See trace file /u01/app/oracle/diag/rdbms/oadb/oadb/trace/oadb_ora_13513.trc for details
Finished backup at 18-AUG-15


3.使用RMAN工具修复
RMAN> blockrecover datafile 5 block 2969045;
4.再次查询故障块信息:
SQL> select * from v$database_block_corruption;

no rows selected

方法2 DBMS_REPAIR.SKIP_CORRUPT_BLOCKS跳过坏块,然后将dept表中的其他数据导出重建表
设置多块读
alter session set db_file_multiblock_read_count=1;

对检查出的坏块,可选择性地进行标记:
select BLOCK_ID, CORRUPT_TYPE, CORRUPT_DESCRIPTION
from REPAIR_TABLE;
 
REM Mark the identified blocks as corrupted ( Soft Corrupt - reference Note 1496934.1 )
DECLARE num_fix INT;
BEGIN
 num_fix := 0;
 DBMS_REPAIR.FIX_CORRUPT_BLOCKS (
 SCHEMA_NAME => '&schema_name',
 OBJECT_NAME=> '&object_name',
 OBJECT_TYPE => dbms_repair.table_object,
 REPAIR_TABLE_NAME => 'REPAIR_TABLE',
 FIX_COUNT=> num_fix);
 DBMS_OUTPUT.PUT_LINE('num fix: ' || to_char(num_fix));
END;
/
在将来进行DML操作时,对坏块进行跳过处理:
BEGIN
 DBMS_REPAIR.SKIP_CORRUPT_BLOCKS (
 SCHEMA_NAME => '&schema_name',
 OBJECT_NAME => '&object_name',
 OBJECT_TYPE => dbms_repair.table_object,
 FLAGS => dbms_repair.SKIP_FLAG);
END;
/
注意:
使用DBMS_REPAIR访问坏块后,INDEX scan可能会出现报错,碰到这类报错,你需要重建这些索引。如果是唯一索引,那么相同数据的重新插入可能会报ORA-1错误。
如果在dbms_repair.SKIP_FLAG已经启用后,希望将跳块标记清除以重新访问坏块,可以在执行DBMS_REPAIR.SKIP_CORRUPT_BLOCKS时,使用dbms_repair.NOSKIP_FLAG进行参数设置。
使用DBMS_REPAIR.SKIP_CORRUPT_BLOCKS来跳块仅能针对出现ORA-1578报错的那些坏块情况。如果是针对其它类型坏块,就需要额外执行ADMIN_TABLES, CHECK_OBJECT 和FIX_CORRUPT_BLOCKS来对坏块进行标记处理。
在执行过SKIP_CORRUPT_BLOCKS后,如果需要将表中的坏块进行清理,可以对表使用”alter table <name> MOVE”,而不是重建或truncate掉它。然后使用dbms_repair.NOSKIP_FLAG去除掉跳块标记即可。注意,坏块中的数据会被丢失掉。
 
SQL> alter session set db_file_multiblock_read_count=1;
 
SQL> create table dept_new as select * from dept;

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章