IPS阻擋未知archlog導致standbyDB無法同步(ORA-03135,ORA-16055)

IPS阻擋未知archlog導致standbyDB無法同步

EnvironmentHPUX 11.31 ORACLE 10.2.0.4 PDB 3nodes RAC Standby 2nodes RAC

SymptomstandbyDB無法同步,越來越多的archlog未被apply,且存在gapstandby DB在等待113171這個log,但是這個log傳不過來。primary DB

2號節點alertlog報錯如下:

ORA-03135: connection lost contact

Tue Jul 24 12:18:06 2012

FAL[server, ARC5]: FAL archival, error 3135 closing archivelog file 'sfc12stb'

FAL[server, ARC5]: FAL archive failed, see trace file.

Tue Jul 24 12:18:06 2012

Errors in file /apps/oracle/admin/sfc12db/bdump/sfc12db2_arc5_16226.trc:

ORA-16055: FAL request rejected

ARCH: FAL archive failed. Archiver continuing Tue Jul 24 12:18:06 2012 ORACLE Instance sfc12db2 - Archival Error. Archiver continuing.

Tue Jul 24 12:19:59 2012

 

Solution

 根據alertlog中的保存檢索MOS,如下

ORA - 03135 : connection lost contact while shipping from Primary Server to Standby server [ID 739522.1]附錄1

Changes There may be a firewall rule changed or the firewall is newly installed.

 根據這個信息聯繫網絡部門同事,可網絡部門同事嚴重誤導了我,說我的主庫和備庫之間是純路由網絡沒有firewall,這樣我就開始不斷嘗試其他的辦法試圖解決這個問題。

1,       重啓2號節點。

StandbyDB:Alter database recover managed standby database cancel

alter database recover managed standby database disconnect;

     依然無效。問題就是當standby視圖獲取archive gap 113171時直接被

     二號節點拒絕ORA-03135: connection lost contact

2.      輪流重啓了所有節點,且重啓的standby DBos沒有效果gap 113171這個log無法傳送過來。

3.        嘗試手動傳輸113171這個log

           A.將sequence=113171的這個logASM傳到filesystem步驟如下:

        1.Log ontothe target database that is local to the ASM instance as the sys user.

 

2.create source directory with in the target database.

 

SQL>create or replace directory SOURCE_DIR as '+DGARCH/SFC12DB/ARCHIVELOG/2012_07_24';

Directory created.

(In this example +DGARCH/SFC12DB/ARCHIVELOG/is the source directory where the datafile is located and where you wish to copy the file from.)

 

3.create destination directory with in database.

SQL>createor replace directory ORACLE_DEST as '/tmp';

Directory created.

(In this example /restoreisthe destination directory where the datafile is to be copied to.)

4.Executethe dbms_file_transfer package.

SQL>

BEGIN

dbms_file_transfer.copy_file(source_directory_object =>

' SOURCE_DIR ', source_file_name => 'thread_2_seq_113171.540.789475773',

destination_directory_object =>'ORACLE_DEST',

destination_file_name =>'thread_2_seq_113171.540.789475773');

END;

/

Ps:oracle 11GASM的文件是可以直接copy的就不用這麼麻煩了。

B.利用FTP傳輸到standbyDB

  問題是我這個archlog 17M

ftp傳輸到13.9M時卡住不動。無法傳輸ok…..,我有試着傳輸大於17M的文件可以傳送。小於17M的也可以。難道是這個log有問題?我又嘗試在同一個網段傳送這個log居然可以傳送(primaryDB 172.16.50.31/32/33 standbyDB 172.16.51.151/152),糾結了。這時已經是凌晨一點多了,距離問題發生已經13個小時,這十個來小時高度緊張,思考疲憊死了,睡覺去明天再找網絡部門排查。

 

 

 

4.     一夜沒睡好,第二天找網絡部門同事一起來看,爲什麼跨網段後,這個113171 log就無法完整傳輸?和兩位網絡部門的同事一起嘗試不同網段之間傳送這個文件,還是無法完整傳輸。這事網絡部門同事想到我們公司還有一個IPSintrusion prevention system入侵防禦系統。檢查日誌終於找到問題的源頭,未知的原因導致IPS113171這個文件當成DOS攻擊,阻止了主庫和備庫之間的傳輸。(我有想掐死他們的衝動,昨天問過他們好多次這兩個網段間是否有firewall,他們一直肯定的說沒有,好吧別給我解釋IPSfirewall不是一回事…..

  

  

5.  問題的原因找到網絡同事修改IPS策略之後可以完整傳送這個log

  standby DB手動註冊這個log

  Alter database register logfile'/apps/oracle/1_117472_679867280.dbf';

   這個gap消除之後 standbydb 又報了兩個gap是其他兩個thread

  可是這兩個log已經沒有(備份策略是備份後刪除archivelog只保留最近六個小時的),這樣需要從磁帶中抓取那兩個log

 腳本如下:

 RUN

{

      allocate channel dev0 type'sbt_tape'

  parms'SBT_LIBRARY=/opt/omni/lib/libob2oracle8_64bit.so,ENV=(OB2BARTYPE=Oracle8,OB2APPNAME=sfc12db,OB2BARLIST=sfc12rc3_fullDB)';

      SETARCHIVELOG DESTINATIONTO'/tmp';

      RESTORE ARCHIVELOG SEQUENCE 113854 thread 3 ; --只是恢復需要的log

      release channel dev0;

}

   附錄1.使用From Until語句

restore archivelog from sequence 69346 until sequence 69397 thread 1;
restore archivelog from sequence 75193 until sequence 75263 thread 2;

2.使用Between語句

restore archivelog sequence between 134 and 136 thread 1;
restore archivelog sequence between 56 and 58 thread 2;

 

  

對備庫進行設置:

alter database recover managed standby database cancel;

alter database recover managed standby database disconnect;

當所有的log都被applied之後

alter database recover managed standby database cancel;

alter database recover managed standby database using current logfile disconnect;---啓用real time apply處理OK

 

 

 思考:原來的archivelog 都可以正常的傳送而不會被擋掉,發生了什麼使得IPS將這個log當作了DOS攻擊?.......

 

 

 

 

 

附錄一

ORA - 03135 : connection lost contact while shipping from Primary Server to Standby server [ID 739522.1]


 

Modified 19-AUG-2011     Type PROBLEM     Status MODERATED

 

In this Document
  Symptoms
  Changes
  Cause
  Solution
  References


This document is being delivered to you via Oracle Support's Rapid Visibility (RaV) process and therefore has not been subject to an independent technical review.

Applies to:

Oracle Net Services - Version: 10.1.0.2.0 to 11.2.0.2 - Release: 10.1 to 11.2
Information in this document applies to any platform.

Symptoms

customer reported intermittently the archiver process hits error and
following seen in the primary node alert log

Errors in file /opt/oracle/admin/ORCL/bdump/orcl1_arc4_28059.trc:
ORA-03135: connection lost contact
Mon Sep 22 22:24:51 2008
FAL[server, ARC4]: FAL archive failed, see trace file.
Mon Sep 22 22:24:51 2008
Errors in file /opt/oracle/admin/ORCL/bdump/orcl1_arc4_28059.trc:
ORA-16055: FAL request rejected



.

Changes

There may be a firewall rule changed or the firewall is newly installed.

Cause

Mainly ORA-3135 occurs when the connection is broken because of underlying network issues.

In  this case, an intermediate firewall between primary and secondary server is altering the data inside the sqlnet packet.

The trace clearly shows that primary sends a data with

[22-SEP-2008 22:24:51:414] nspsend: 53 53 3D 28 50 52 4F 54 |SS=(PROT|
[22-SEP-2008 22:24:51:414] nspsend: 4F 43 4F 4C 3D 54 43 50 |OCOL=TCP|
[22-SEP-2008 22:24:51:414] nspsend: 29 28 48 4F 53 54 3D 70 |)(HOST=h|
[22-SEP-2008 22:24:51:414] nspsend: 6E 79 6D 65 72 63 75 72 |ostname1|
[22-SEP-2008 22:24:51:414] nspsend: 79 64 62 30 39 2D 76 69 |_db011-v|
[22-SEP-2008 22:24:51:414] nspsend: 72 74 29 28 50 4F 52 54 |ip)(PORT|
[22-SEP-2008 22:24:51:414] nspsend: 3D 31 35 32 31 29 29 61 |=1521))a|
[22-SEP-2008 22:24:51:414] nspsend: 20 4C 0B 05 01 00 0D 00 |.L......|


and the firewall / intermediate network device is changing the hostname to IP address (possibly a
rule set in the firewall).

[22-SEP-2008 22:24:51:586] nsprecv: 53 53 3D 28 50 52 4F 54 |SS=(PROT|
[22-SEP-2008 22:24:51:586] nsprecv: 4F 43 4F 4C 3D 54 43 50 |OCOL=TCP|
[22-SEP-2008 22:24:51:586] nsprecv: 29 28 48 4F 53 54 3D 31 |)(HOST=1|
[22-SEP-2008 22:24:51:586] nsprecv: 36 31 2E 32 32 31 2E 32 |11.111.1|
[22-SEP-2008 22:24:51:586] nsprecv: 31 37 2E 35 35 29 28 50 |23.45)(P|
[22-SEP-2008 22:24:51:586] nsprecv: 4F 52 54 3D 37 31 36 39 |ORT=7171|
[22-SEP-2008 22:24:51:586] nsprecv: 29 29 61 20 4C 0B 05 01 |))a.L...|

Solution

Involve the firewall administration team to check whether the firewall /intermediate network device can be stopped from altering the sqlnet packet.

If this is not the case, then we need to take matching sqlnet server traces at support level to narrow the issue.

Enabling the server trace on the primary and restarting the archiver process will generate sqlnet trace for the archiver process.

 Solution to tracing archiver process without getting tons of trace for other background processes and connections is to

1. Enabled Sever tracing on primary
2. Kill the archive process that is archiving log files remote (dont kill the wrong archiver)
3. Disable trace

Do these 3 actions as quickly as possible. Tie the server trace to the new archive process via PID.
Oracle net server trace will have the PID in net trace name
Archive process will be traced untill the process is stopped/killed. So you may wish to kill the process again, once error reproduced.

WARNING !!!Small chance the process wont re-spawn



Then enable the sqlnet tracing on the standby server to get the corresponding process which takes the request from this archiver process.

Once error is hit match the archiver sqlnet trace file to see the network issue if any.

Note: you can easily cross check the sqlnet server trace and archiver process by means of Process ID which is appended to the trace file name in the bdump.

 

For enabling tracing referNote 395525.1  How to Enable Oracle SQLNet Client , Server , Listener , Kerberos and External procedure Tracing from Net Manager

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章