Master Note: Troubleshooting Database Transaction Recovery [ID 1494886.1]

In this Document

  Purpose
  Details
  Concepts
  Symptoms
  Modes of Transaction Recovery
  Parallel Transaction Recovery
  Serial Transaction Recovery
  Detection of Transaction Recovery
  Identify Dead Transactions and their Sizes
  Identify undo segments containing dead transactions
  Identify the Time for transaction recovery to complete
  Query to identify the number of parallel Recovery Slaves
  Identify Objects involved in Transaction Recovery
  Data Dictionary Tables
 
Dump the UNDO block
  Disable Transaction Recovery
  Suggestions
  Information to collect for Oracle Support
  Additional Resources
  References

Applies to:

Oracle Server - Enterprise Edition - Version 9.2.0.8 and later
Information  in this document applies to any platform.

Purpose

To discuss the concepts, modes and detection of Database Transaction recovery.

Details

Concepts

Transaction recovery involves rolling back all uncommitted transactions of a failed instance. These are "in-progress" transactions that did not commit and that Oracle needs to undo. It is possible for uncommitted transactions to get saved to disk. In this case, Oracle uses undo data to reverse the effects of any changes that were written to the datafiles but not yet committed.

Oracle applies undo blocks to roll back uncommitted changes in data blocks that were either written before the crash or introduced by redo application during cache recovery. This normally happens during the Roll backward phase when the DB is restarted. Transaction Recovery can be performed by either the Server process which initiated the Transaction or the SMON process (in case the Server process is dead).

SMON process takes over the recovery when

  • Server process is dead / crashed.
  • Instance itself is crashed

Undo blocks (whether in rollback segments or automatic undo tablespaces) record database actions that may need to be undone during certain database operations. In database recovery, the undo blocks roll back the effects of uncommitted transactions previously applied by the rolling forward phase.

After the roll forward, any changes that were not committed must be undone. Oracle applies undo blocks to roll back uncommitted changes in data blocks that were either written before the failure or introduced by redo application during cache recovery. This process is called rolling back or transaction recovery.

              Description of Figure 15-3 follows

                        Fig 1: Basic Recovery Steps: Rolling Forward and Rolling Back

Symptoms


  • High CPU Utilization by SMON process
  • Database may hang during large transaction recovery.
  • If Database is shutdown abort, then the database may hang during consequent startup.
  • Database repeatedly crashes while open.
  • Database internal errors which could result in transaction recovery failure.

Modes of Transaction Recovery


Parallel Transaction Recovery

Recovery occurs in parallel mode. Several parallel slave processes will be spawned and will be involved in recovery. This is also termed as Fast Start Parallel Rollback. The background process SMON acts as a coordinator and rolls back a set of transactions in parallel using multiple server processes. To enable Parallel recovery mode, set the parameter FAST_START_PARALLEL_ROLLBACK to LOW / HIGH.

Values:

  • FALSE - Parallel rollback is disabled
  • LOW - Limits the maximum degree of parallelism to 2 * CPU_COUNT
  • HIGH -Limits the maximum degree of parallelism to 4 * CPU_COUNT

If you change the value of this parameter, then transaction recovery will be stopped and restarted with the new implied degree of parallelism. Fast-start parallel rollback is mainly useful when a system has transactions that run a long time before committing, especially parallel INSERT, UPDATE, and DELETE operations. When SMON discovers that the amount of recovery work is above a certain threshold, it automatically begins parallel rollback by dispersing the work among several parallel processes: process 1 rolls back one transaction, process 2 rolls back a second transaction, and so on. The threshold is the point at which parallel recovery becomes cost-effective, in other words, when parallel recovery takes less time than serial recovery.

Oracle® Database Reference 11g Release 2 (11.2)
Part Number E25513-02

http://docs.oracle.com/cd/E11882_01/server.112/e25513/initparams089.htm#REFRN10059

Serial Transaction Recovery

This mode recovers the transaction sequentially. Many of the times, serial transaction recovery will be faster. Setting the FAST_START_PARALLEL_ROLLBACK parameter to false will enable the serial transaction recovery.

Detection of Transaction Recovery


Identify Dead Transactions and their Sizes

select ktuxeusn  USN, ktuxeslt Slot, ktuxesqn  Seq, ktuxesta State, ktuxesiz Undo
from x$ktuxe 
where ktuxesta <> 'INACTIVE' 
and ktuxecfl like '%DEAD%'
order by ktuxesiz  asc;

===========

ktuxeusn –  Undo Segment Number

ktuxeslt   –  Slot number

ktuxesqn –  Sequence

ktuxesta  –  State

ktuxesiz  –  Undo Blocks Remaining

ktuxecfl   –  Flag

===========

  USN       SLOT        SEQ         STATE          UNDO

-------    ----------    -------     ----------    ----------

     9            9         335         ACTIVE         10337

Identify undo segments containing dead transactions

select  useg.segment_name, useg.segment_id, useg.tablespace_name, useg.status
from dba_rollback_segs useg
where useg.segment_id  in (select unique ktuxeusn
                                        from x$ktuxe
                                        where ktuxesta <> 'INACTIVE'
                                        and ktuxecfl like '%DEAD%');

SEGMENT_NAME     SEGMENT_ID          TABLESPACE_NAME                STATUS

--------------------   -----------------     ---------------------------          ------------

_SYSSMU9$                     9                      UNDOTBS1                         ONLINE


select  usn,name
from v$rollname
where usn in (select unique ktuxeusn
                    from x$ktuxe
                    where ktuxesta <> 'INACTIVE'
                    and ktuxecfl like '%DEAD%');

Identify the Time for transaction recovery to complete

select usn, state, undoblockstotal "Total", undoblocksdone "Done", undoblockstotal-undoblocksdone "ToDo",
         decode(cputime,0,'unknown',sysdate+(((undoblockstotal-undoblocksdone) / (undoblocksdone / cputime)) / 86400))  "Estimated time to complete" 
from v$fast_start_transactions;

  USN      STATE               Total       Done      ToDo      Estimated time to Complete

-------- ----------------     --------   --------   --------    ---------------------------------

    5    RECOVERING        16207     14693      1514     11-may-2012 08:05:40

    4    RECOVERED          7453       7453         0        11-may-2012 08:05:16

    9    RECOVERED         10337      10337       0         11-may-2012 08:05:16

Query to identify the number of parallel Recovery Slaves

select * from v$fast_start_servers;

STATE                UNDOBLOCKSDONE        PID         XID

-----------          --------------------------    ---------- ----------------

RECOVERING              54                        18       001F005C00001BD6

RECOVERING               0                         20      001F005C00001BD6

RECOVERING               0                         21      001F005C00001BD6

RECOVERING               0                         22      001F005C00001BD6

Column STATE shows the state of the server being IDLE or RECOVERING. If only 1 process is in state RECOVERING while the other processes are in state IDLE, then you should disable Parallel Transaction Recovery. Normally these queries will show the progress of the transaction recovery.

If there is no progress in the number of blocks, then we can consider disabling the parallel recovery and let the recovery continue in serial transaction.

Identify Objects involved in Transaction Recovery


Data Dictionary Tables

select decode(px.qcinst_id,NULL,username,' - '||lower(substr(s.program,length(s.program)-4,4) ) ) "Username",
         decode(px.qcinst_id,NULL, 'QC', '(Slave)') "QC/Slave" ,
         to_char( px.server_set) "Slave Set",
         to_char(s.sid) "SID",
         decode(px.qcinst_id, NULL ,to_char(s.sid) ,px.qcsid) "QC SID",
         px.req_degree "Requested DOP",
         px.degree "Actual DOP"
from gv$px_sessionpx, gv$session s
where px.sid=s.sid (+)
and px.serial#=s.serial#
order by 5 , 1 desc ;

ORA User       QC/Slav     Slave Set   SID   QC SID     Requested DOP    Actual DOP

----------      -----------     ----------  ------   --------   -------------------  -----------------

                       QC                          159     159   

-            p007   (Slave)        1         137     159                 8                        8

-            p006   (Slave)        1         138     159                 8                        8

-            p005   (Slave)        1         135     159                 8                        8

-            p004   (Slave)        1         132     159                 8                        8

-            p003   (Slave)        1         136     159                 8                        8

-            p002   (Slave)        1         130     159                 8                        8

-            p001   (Slave)        1         133     159                 8                        8

-            p000   (Slave)        1         145     159                 8                        8

select distinct current_obj#
from v$active_session_history
where SESSION_ID=145
and SESSION_SERIAL#=15;   --  (This Query to be run multiple times and you should be able to see the Current Object id consistently)

CURRENT_OBJ#

-------------------

       77211

select Object_name, object_type
from dba_objects
where object_id=77211;

OBJECT_NAME  OBJECT_TYPE

---------------  ----------------

EMPLOYEES          TABLE

Note: This method helps when the recovery is Parallel Transaction Recovery.


Dump the UNDO block

1.Identify the UNDO name using the undo segment name and transaction details.

select sysdate, b.nameuseg, a.ktuxeusnxid_usn, a.ktuxesltxid_slot, a.ktuxesqnxid_seq
from x$ktuxe a, undo$ b
where a.ktuxesta = 'ACTIVE'
and a.ktuxecfl like '%DEAD%'
and a.ktuxeusn = b.us#;


2 Dump the Undo Block

Alter system dump undo block ‘<undo seg name>’ xid xidusn xidslot xidsqn;

Eg:

sqlplus / as sysdba

oradebug setmypid

Alter system dump undo block ‘_SYSSMU16$’ xid 16 21 1408186;

oradebug tracefile_name

3 Upload the resultant Trace file to Oracle Support.
   This will help Oracle Support to review the UNDO dump and identify the Object ID.

Disable Transaction Recovery


  • Disabling Transaction recovery could temporarily open up the database.
  • SMON will stop performing all the transaction recovery.
  • There is no way to stop only one Transaction recovery / few transaction recoveries.
  • When the transaction recovery is disabled, any data modified by this dead transaction but required by other transactions will be recovered on "as needed" basis.
  • We always suggest to disable the transaction recovery only with the help of Oracle Support.

NOTE: Oracle does not recommend to Disable the Transaction recovery and open the Database for Business Use.

Suggestions


  • Check the long running Transactions before shutting down the Database
  • DBA and Application team to be aware of any long running Jobs.
  • Plan for the downtime and inform application team accordingly so that they do not schedule business critical / long running jobs.
  • Do not Panic. Transaction recovery is part of Database normal Operations

Information to collect for Oracle Support


1) Complete Database Alert log (at least logs from last 2 startups)

2) SMON Trace file (if available)

3) Undo Block dump Trace file (if taken).

4) Output of following Queries (3 outputs in intervals of at least 5 min)

=============

spool /tmp/transaction_recovery.html
set markup html on
set time on

alter session set nls_date_format='dd-mon-yyyy hh24:mi:ss';

select sysdate from dual;

show parameter fast_start_parallel_rollback

select ktuxeusn USN, ktuxeslt Slot, ktuxesqn Seq, ktuxesta State,ktuxesiz Undo from x$ktuxe where ktuxesta <> 'INACTIVE' and ktuxecfl like '%DEAD%' order by ktuxesiz asc;

select usn, state, undoblockstotal "Total", undoblocksdone "Done", undoblockstotal-undoblocksdone "ToDo", decode(cputime,0,'unknown',sysdate+(((undoblockstotal-undoblocksdone) / (undoblocksdone / cputime)) / 86400))
"Estimated time to complete" from v$fast_start_transactions;

select * from v$fast_start_servers;

select ktuxeusn, to_char(sysdate,'DD-MON-YYYY HH24:MI:SS') "Time", ktuxesiz, ktuxesta from x$ktuxe where ktuxecfl = 'DEAD';

set markup html off
spool off

=============

Additional Resources


Note 238507.1 How to Disable Parallel Transaction Recovery When Parallel Txn Recovery is Active
Note 144332.1 Parallel Rollback may hang database, Parallel query servers get 100% cpu
Note 414242.1 Database Hangs Because SMON Is Taking 100% CPU Doing Transaction Recovery
Note 416823.1 Diagnoing BUG 5016142 enqueue ENQ - SS contention
Note 461480.1 FAQ – Automatic Undo Management (AUM) / System Managed Undo (SMU)
Note 266159.1 SMON: ABOUT TO RECOVER UNDO SEGMENT %s messages in alert log
Note 1060831.6 Rollback Is Taking A Long Time After Session Killed
Note 11790175.8 Bug 11790175 - SMON spins in rollback of LOB due to double allocated block like bug 11814891
Note 8240762.8 Bug 8240762 - Undo corruptions with ORA-600 [4193]/ORA-600 [4194] or ORA-600 [4137] / SMON may spin to recover transaction
Note 11902008.8 Bug 11902008 - SMON may crash with ORA-00600 [kcbgcur_3] or ORA-600 [kcbnew_3] during Transaction recovery

References

NOTE:1506115.1 - Master Note: Oracle Transaction Management (Local) Overview
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章