hpux- hp小型機日常硬件故障處理case(二)

主機型號:ia64 hp superdome server SD32A

存儲型號:XP24000

軟件版本:hpux 11.31 + oracle 11g 

LED顯示:attention 紅燈

故障現象:cell 2 面板指示燈不亮,cell 2 出現故障

日誌:

Log Entry 39897: 06/02/2013 08:35:49
Alert level 5: Critical
Keyword: PD_ERROR_REACHABLE_SET
The cell is not able to reach all requested cells through the fabric.
Reporting Entity: System Firmware located in cabinet 0, slot 1, cpu 0
Actual Data: 0x0000000000000053
0xa380020310e01800 0x0000000000000053
0xab00020310e01801 0x0100000051ab03e5

MP:VWR (,,+,-,?,F,L,J,D,K,R,T,A,C,U,^B) >
Log Entry 39897: 06/02/2013 08:35:49
Alert level 5: Critical
Keyword: PD_ERROR_REACHABLE_SET
The cell is not able to reach all requested cells through the fabric.
Reporting Entity: System Firmware located in cabinet 0, slot 1, cpu 0
Actual Data: 0x0000000000000053
0xa380020310e01800 0x0000000000000053
0xab00020310e01801 0x0100000051ab03e5

Log Entry 39896: 06/02/2013 08:35:49
Alert level 3: Warning
Keyword: ERM
The Error Response Mode has been determined
Reporting Entity: System Firmware located in cabinet 0, slot 0, cpu 0
Text Message: "CONTINUE"
0x698001ee00e017fe 0x45554e49544e4f43
0x6b0001ee00e017ff 0x0100000051ab03e5

 

W status for Cell 2 in cabinet 0: FAILURE DETECTED
Cell power Status: enabled, OFF, CRITICAL FAULT, NVRAM battery good
Boot is blocked; PDH shared memory is not-initialized
Cell Attention LED is off, PDH status LEDs: **__
Cell enabled by PDHC
Core cell is cabinet 0, cell 0
RIO cable status: unavailable
RIO cable connection physical location: cannot be determined

| * - CPU Modules | |Cell Power Board |
| t - Terminators | Cell Board |Converter Faults |
|Populated| Faulted | Power Converter Faults | MEM | JAB | IB |
| 0 1 2 3 | 0 1 2 3 |CLK|L2C|LNK|CORE|FSB|48V|0 1 2| 0 1 |0 1 2|
+---------+---------+---+---+---+----+---+---+-----+-----+-----+
| * * * * | | | | | | | | | |* * |

| | Side: | A | B |
| DIMM Presence | Echlon: |0123456789ABCDEF|0123456789ABCDEF|
| | |******__________|******__________|

Cell Compatibility: Complex - B, Partition - C; CPU Compatibility: B
IPF System firmware rev 8.22
PDH controller firmware rev 15.16, time stamp: WED OCT 11 17:10:55 2006

 

處理情況:

根據日誌(紅色標記處),判斷CPB出問題。

1.2013-06-02 下午 15:00左右,CPB備件到貨。

2.2013-06-02
晚上 20:30左右,主機停機,停電,進行更換CPB,更換後,加電,發現 Cell 2 仍然LED指示燈不亮。

3.
MP查看報錯信息發現 Cell 2板上 FSB出問題,

如圖所示,判定FSB出問題
| * - CPU Modules | |Cell Power Board |
| t - Terminators | Cell Board |Converter Faults |
|Populated| Faulted | Power Converter Faults | MEM | JAB | IB |
| 0 1 2 3 | 0 1 2 3 |CLK|L2C|LNK|CORE|
FSB|48V|0 1 2| 0 1 |0 1 2|
+---------+---------+---+---+---+----+---+---+-----+-----+-----+
| * * * * |              
|     |      |     |          | * |
  |     |   |    | 


4.2013-06-02 晚上 21:30 ,更換CELL板上 FSB固件插在主機 Cell2上。

5.
加電,發現 主機 Cell 2指示燈現在正常,進入操作系統,發現 Cell 2上的內存,CPU能夠識別,Cell2顯示正常。如圖:

[Cell]
CPU Memory Use
OK/ (GB) Core On
Hardware Actual Deconf/ OK/ Cell Next Par
Location Usage Max Deconf Connected To Capable Boot Num
========== ============ ======= ========= =================== ======= ==== ===
cab0,cell0 Active Core 4/0/4 16.0/0.0 cab0,bay1,chassis3 yes yes 0
cab0,cell1 Active Base 4/0/4 12.0/0.0 cab0,bay0,chassis1 no yes 0
cab0,cell2 Active Base 4/0/4 12.0/0.0 - no yes 0
cab0,cell3 Absent * - - - - - -
cab0,cell4 Active Base 4/0/4 8.0/0.0 cab0,bay1,chassis1 yes yes 0
cab0,cell5 Absent * - - - - - -
cab0,cell6 Active Base 4/0/4 8.0/0.0 cab0,bay0,chassis3 no yes 0
cab0,cell7 Absent * - - - - - -

 

總結:此次問題,第一次日誌報警是假象,實際是因爲cell板固件FSB出問題引起 (FSB相當於一個電阻,擁有減壓的)。

發佈了42 篇原創文章 · 獲贊 11 · 訪問量 9萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章