主機型號:ia64 hp superdome server SD32A
存儲型號:XP24000
軟件版本:hpux 11.31 + oracle 11g
LED顯示:attention 紅燈
故障現象:cell 2 面板指示燈不亮,cell 2 出現故障
日誌:
Log Entry 39897: 06/02/2013 08:35:49
Alert level 5: Critical
Keyword: PD_ERROR_REACHABLE_SET
The cell is not able to reach all requested cells through the fabric.
Reporting Entity: System Firmware located in cabinet 0, slot 1, cpu 0
Actual Data: 0x0000000000000053
0xa380020310e01800 0x0000000000000053
0xab00020310e01801 0x0100000051ab03e5
MP:VWR (,,+,-,?,F,L,J,D,K,R,T,A,C,U,^B) >
Log Entry 39897: 06/02/2013 08:35:49
Alert level 5: Critical
Keyword: PD_ERROR_REACHABLE_SET
The cell is not able to reach all requested cells through the fabric.
Reporting Entity: System Firmware located in cabinet 0, slot 1, cpu 0
Actual Data: 0x0000000000000053
0xa380020310e01800 0x0000000000000053
0xab00020310e01801 0x0100000051ab03e5
Log Entry 39896: 06/02/2013 08:35:49
Alert level 3: Warning
Keyword: ERM
The Error Response Mode has been determined
Reporting Entity: System Firmware located in cabinet 0, slot 0, cpu 0
Text Message: "CONTINUE"
0x698001ee00e017fe 0x45554e49544e4f43
0x6b0001ee00e017ff 0x0100000051ab03e5
W status for Cell 2 in cabinet 0: FAILURE DETECTED
Cell power Status: enabled, OFF, CRITICAL FAULT, NVRAM battery good
Boot is blocked; PDH shared memory is not-initialized
Cell Attention LED is off, PDH status LEDs: **__
Cell enabled by PDHC
Core cell is cabinet 0, cell 0
RIO cable status: unavailable
RIO cable connection physical location: cannot be determined
| * - CPU Modules | |Cell Power Board |
| t - Terminators | Cell Board |Converter Faults |
|Populated| Faulted | Power Converter Faults | MEM | JAB | IB |
| 0 1 2 3 | 0 1 2 3 |CLK|L2C|LNK|CORE|FSB|48V|0 1 2| 0 1 |0 1 2|
+---------+---------+---+---+---+----+---+---+-----+-----+-----+
| * * * * | | | | | | | | | |* * |
| | Side: | A | B |
| DIMM Presence | Echlon: |0123456789ABCDEF|0123456789ABCDEF|
| | |******__________|******__________|
Cell Compatibility: Complex - B, Partition - C; CPU Compatibility: B
IPF System firmware rev 8.22
PDH controller firmware rev 15.16, time stamp: WED OCT 11 17:10:55 2006
處理情況:
根據日誌(紅色標記處),判斷CPB出問題。
1.2013-06-02
下午 15:00左右,CPB備件到貨。
2.2013-06-02 晚上 20:30左右,主機停機,停電,進行更換CPB,更換後,加電,發現
Cell 2 仍然LED指示燈不亮。
3.進MP查看報錯信息發現
Cell 2板上 FSB出問題,
如圖所示,判定FSB出問題
| * - CPU Modules | |Cell Power Board |
| t - Terminators | Cell Board |Converter Faults |
|Populated| Faulted | Power Converter Faults | MEM | JAB | IB |
| 0 1 2 3 | 0 1 2 3 |CLK|L2C|LNK|CORE|FSB|48V|0 1 2| 0 1 |0 1 2|
+---------+---------+---+---+---+----+---+---+-----+-----+-----+
| * * * * | | | | | | * | | | | |
4.2013-06-02 晚上 21:30 ,更換CELL板上
FSB固件插在主機 Cell2上。
5.加電,發現 主機 Cell 2指示燈現在正常,進入操作系統,發現
Cell 2上的內存,CPU能夠識別,Cell2顯示正常。如圖:
[Cell]
CPU Memory Use
OK/ (GB) Core On
Hardware Actual Deconf/ OK/ Cell Next Par
Location Usage Max Deconf Connected To Capable Boot Num
========== ============ ======= ========= =================== ======= ==== ===
cab0,cell0 Active Core 4/0/4 16.0/0.0 cab0,bay1,chassis3 yes yes 0
cab0,cell1 Active Base 4/0/4 12.0/0.0 cab0,bay0,chassis1 no yes 0
cab0,cell2 Active Base 4/0/4 12.0/0.0 - no yes 0
cab0,cell3 Absent * - - - - - -
cab0,cell4 Active Base 4/0/4 8.0/0.0 cab0,bay1,chassis1 yes yes 0
cab0,cell5 Absent * - - - - - -
cab0,cell6 Active Base 4/0/4 8.0/0.0 cab0,bay0,chassis3 no yes 0
cab0,cell7 Absent * - - - - - -
總結:此次問題,第一次日誌報警是假象,實際是因爲cell板固件FSB出問題引起 (FSB相當於一個電阻,擁有減壓的)。