oracle RAC一個節點頻繁重啓

故障現象：

2011年的一次問題，oracle 11gr2 rac + redhat linux ，2節點rac中的其中一個節點頻繁重啓；

原因分析：

主機日誌

VIP發生了漂移，重啓後又歸位

node1

Nov 23 18:22:27 dtydb2 avahi-daemon[13096]: Withdrawing address record for 10.4.124.242 on bond2.

Nov 23 18:22:31 dtydb2 avahi-daemon[13096]: Withdrawing address record for 169.254.188.250 on bond1.

Nov 23 18:23:10 dtydb2 avahi-daemon[13096]: Registering new address record for 169.254.188.250 on bond1.

Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Registering new address record for 10.4.124.242 on bond2.

Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Withdrawing address record for 10.4.124.242 on bond2.

Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Registering new address record for 10.4.124.242 on bond2.

Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Withdrawing address record for 10.4.124.242 on bond2.

Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Registering new address record for 10.4.124.242 on bond2.

node2

Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Registering new address record for 10.4.124.242 on bond2.

Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Withdrawing address record for 10.4.124.242 on bond2.

Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Registering new address record for 10.4.124.242 on bond2.

Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Withdrawing address record for 10.4.124.242 on bond2.

Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Registering new address record for 10.4.124.242 on bond2.

Nov 23 18:23:34 dtydb1 avahi-daemon[13132]: Withdrawing address record for 10.4.124.242 on bond2.

數據庫日誌

不能連接ASM，所有重啓

ORA-15064: communication failure with ASMinstance

ORA-03113: end-of-file on communicationchannel

ASM日誌

and the ASM instance has the alert info

Wed Nov 23 18:22:29 2011

NOTE: client exited [13858]

Wed Nov 23 18:22:29 2011

NOTE: ASMB process exiting, either shutdown is in progress

NOTE: or foreground connected to ASMB was killed.

Wed Nov 23 18:22:29 2011

PMON (ospid: 13797): terminating the instance due to error 481

Wed Nov 23 18:22:29 2011

ORA-1092 : opitsk aborting process

Wed Nov 23 18:22:30 2011

ORA-1092 : opitsk aborting process

Wed Nov 23 18:22:30 2011

ORA-1092 : opitsk aborting process

Wed Nov 23 18:22:30 2011

ORA-1092 : opitsk aborting process

Wed Nov 23 18:22:30 2011

License high water mark = 16

Instance terminated by PMON, pid = 13797

USER (ospid: 9488): terminating the instance

Instance terminated by USER, pid = 948

ocssd.log：has a disk HB, but no network HB,

2011-11-23 18:22:20.512: [ CSSD][1111939392]clssnmPollingThread: node dtydb1 (1) is impending reconfig, flag 394254, misstime 15910

2011-11-23 18:22:20.512: [ CSSD][1111939392]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)

2011-11-23 18:22:20.512: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004978, LATS 1030715744, lastSeqNo 946497, uniqueness 1321449141, timestamp 1322043740/933687024

2011-11-23 18:22:21.515: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004980, LATS 1030716744, lastSeqNo 1004978, uniqueness 1321449141, timestamp 1322043741/933688024

2011-11-23 18:22:22.518: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004982, LATS 1030717754, lastSeqNo 1004980, uniqueness 1321449141, timestamp 1322043742/933689044

2011-11-23 18:22:23.520: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004984, LATS 1030718754, lastSeqNo 1004982, uniqueness 1321449141, timestamp 1322043743/933690044

2011-11-23 18:22:24.140: [ CSSD][1113516352]clssnmSendingThread: sending status msg to all nodes

2011-11-23 18:22:24.141: [ CSSD][1113516352]clssnmSendingThread: sent 4 status msgs to all nodes

2011-11-23 18:22:24.523: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004986, LATS 1030719754, lastSeqNo 1004984, uniqueness 1321449141, timestamp 1322043744/933691044

2011-11-23 18:22:25.525: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004988, LATS 1030720754, lastSeqNo 1004986, uniqueness 1321449141, timestamp 1322043745/933692044

2011-11-23 18:22:26.527: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004990, LATS 1030721764, lastSeqNo 1004988, uniqueness 1321449141, timestamp 1322043746/933693044

經過部署監控腳本，ping日誌

從18：21：56開始丟包（117-150包丟失）

64 bytes from 192.168.100.1: icmp_seq=114 ttl=64 time=0.342 ms

64 bytes from 192.168.100.1: icmp_seq=115 ttl=64 time=0.444 ms

64 bytes from 192.168.100.1: icmp_seq=116 ttl=64 time=0.153 ms

--- 192.168.100.1 ping statistics ---

150 packets transmitted, 116 received, 22% packet loss, time 149054ms

rtt min/avg/max/mdev = 0.084/0.246/0.485/0.099 ms

Wed Nov 23 18:22:31 CST 2011

繼續分析

經過以上分析，原因基本確認爲RAC節點私有網絡丟包，導致一個節點主機重啓；但爲什麼會丟包呢？在檢查主機網絡配置沒有問題的情況下，只能請網絡工程師協助解決了

網絡專家通過網絡抓包，發現如下現象

觀察到幾個現象，內容來自回覆的郵件：

1. 4:02:09，192.168.100.1在e4cc這塊網卡上發出的ping請求，192.168.100.2沒有把迴應包送到e4cc；

2. 192.168.100.2發出的ping請求數據包，沒有送到192.168.100.1的e4cc這塊網卡，但192.168.100.1主機肯定是收到了，因爲在e4cc這塊網卡上，看到了192.168.100.1給192.168.100.2的迴應數據包；

3. 4:02:41，192.168.100.2的e474網卡向192.168.100.1迴應了Destination unreachable (Port unreachable)，此時192.168.100.2可以正常回包，經過一段時間調整後，4:02:53起，網絡恢復正常。

具體可以理解如下

1，已主機2的丟包爲例，seq9-seq41丟包

64 bytes from 192.168.100.1: icmp_seq=7ttl=64 time=0.170 ms

64 bytes from 192.168.100.1: icmp_seq=8ttl=64 time=0.376 ms

64 bytes from 192.168.100.1: icmp_seq=42ttl=64 time=0.151 ms

64 bytes from 192.168.100.1: icmp_seq=43ttl=64 time=0.340 ms

2，主機2發出了seq9request

04:02:09.284929 00:1b:21:c1:e4:74 >00:1b:21:c1:e4:cc, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: ICMP(1), length: 84) 192.168.100.2 > 192.168.100.1: ICMP echo request, id 59655,seq 9, length 64

04:02:10.284885 00:1b:21:c1:e4:74 >00:1b:21:c1:e4:cc, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: ICMP(1), length: 84) 192.168.100.2 > 192.168.100.1: ICMP echo request, id 59655,seq 10, length 64

3，NE401抓到了主機1回覆的seq9的reply包，但沒有抓到請求包（從另一個NE40轉發的？？）

4，這條seq9的數據庫包沒有送達主機2，或者送達到主機2，主機2沒能正常接收（由於沒有部署主機2端的reply包，此條無法確認）

繼續抓包，主機的備用網卡不停的在發ARP更新請求，這種數據包，影響了二層網絡的MAC地址學習，導致地址學習頻繁切換，極端情況下會導致丟包。建議確認其用途，在不影響業務的情況下，建議關閉這種通信。

解決方法：

down掉交換機上的和主機相連的一個端口，使主機、交換機、防火牆口字型連接，這樣就不會有arp請求發出，問題解決再也沒有出現節點重啓的問題。後來，仔細檢查了網卡綁定的設置，另一個系統的rac節點同樣的配置，使用同樣的交換機卻沒有類似的問題，問題雖然解決了，但最終原因卻未知；

oracle RAC一個節點頻繁重啓

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

linux下應用軟件突然丟失的的故障

oracle RAC一個節點頻繁重啓

sqlplus 連接數據庫報錯SP2-0642: SQL*Plus internal error state 2130, context 0:0:0

客戶端不能正常連接oracle，監聽狀態爲"Not All Endpoints Registered"

11gr2 oracle concepts（翻譯） --第九章數據併發和一致性

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結