內存 ECC 校驗錯誤

目的

dmesg 中發現內存 ECC 校驗錯誤
檢測出有問題的內存位置

dmesg 信息

[    4.745351] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[    4.745359] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[    5.746989] EDAC MC0: 27609 CE memory read error on CPU_SrcID#0_Channel#1_DIMM#0 (channel:1 slot:0 page:0x105649c offset:0x6c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0093 socket:0 channel_mask:2 rank:1)
[    5.747001] EDAC MC0: 23245 CE memory scrubbing error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x105649e offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c3 socket:0 channel_mask:1 rank:1)
[  300.644412] mce: [Hardware Error]: Machine check events logged

獲取內存錯誤信息

grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:23245   <- 校驗錯誤 dimm 0, channel 0, branch 0 
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:27609   <- 校驗錯誤 dimm 0, channel 1, branch 0 
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch3_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch2_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch3_ce_count:0

參考文檔信息

# yum install -y kernel-doc
已加載插件:fastestmirror, langpacks
Loading mirror speeds from cached hostfile
 * updates: mirrors.sh.vclound.com
正在解決依賴關係
--> 正在檢查事務
---> 軟件包 kernel-doc.noarch.0.3.10.0-514.6.2.el7 將被 安裝
--> 解決依賴關係完成

依賴關係解決

==============================================================================================================================
 Package                      架構                     版本                                   源                         大小
==============================================================================================================================
正在安裝:
 kernel-doc                   noarch                   3.10.0-514.6.2.el7                     updates                    15 M

事務概要
==============================================================================================================================
安裝  1 軟件包

總下載量:15 M
安裝大小:48 M
Downloading packages:
kernel-doc-3.10.0-514.6.2.el7.noarch.rpm                                                               |  15 MB  00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  正在安裝    : kernel-doc-3.10.0-514.6.2.el7.noarch                                                                      1/1
  驗證中      : kernel-doc-3.10.0-514.6.2.el7.noarch                                                                      1/1

已安裝:
  kernel-doc.noarch 0:3.10.0-514.6.2.el7

完畢!

參考下面信息

 vim /usr/share/doc/kernel-doc-3.10.0/Documentation/edac.txt

                Channel 0       Channel 1
        ===================================
        csrow0  | DIMM_A0       | DIMM_B0 |
        csrow1  | DIMM_A0       | DIMM_B0 |
        ===================================

        ===================================
        csrow2  | DIMM_A1       | DIMM_B1 |
        csrow3  | DIMM_A1       | DIMM_B1 |
        ===================================

從上面可以看出,這裏分兩部分內存組, mc0, mc1
內存組 mc0 中第一第二內存 ECC 故障
即 DIMM 0 中的 channel 0 與 channel 1 位置

獲取內存位置

 dmidecode -t memory |  grep -E 'Memory Device|Size:|Locator'
Memory Device
        Size: 16384 MB
        Locator: DIMM000
        Bank Locator: BRANCH 0 CHANNEL 0 DIMM 0         <-  故障
Memory Device
        Size: No Module Installed
        Locator: DIMM001
        Bank Locator: BRANCH 0 CHANNEL 0 DIMM 1       
Memory Device
        Size: 16384 MB
        Locator: DIMM010
        Bank Locator: BRANCH 0 CHANNEL 1 DIMM 0       <- 故障
Memory Device
        Size: No Module Installed
        Locator: DIMM011
        Bank Locator: BRANCH 0 CHANNEL 1 DIMM 1
Memory Device
        Size: 16384 MB
        Locator: DIMM020
        Bank Locator: BRANCH 0 CHANNEL 2 DIMM 0
Memory Device
        Size: No Module Installed
        Locator: DIMM021
        Bank Locator: BRANCH 0 CHANNEL 2 DIMM 1
Memory Device
        Size: 16384 MB
        Locator: DIMM030
        Bank Locator: BRANCH 0 CHANNEL 3 DIMM 0
Memory Device
        Size: No Module Installed
        Locator: DIMM031
        Bank Locator: BRANCH 0 CHANNEL 3 DIMM 1
Memory Device
        Size: 16384 MB
        Locator: DIMM100
        Bank Locator: BRANCH 1 CHANNEL 0 DIMM 0
Memory Device
        Size: 16384 MB
        Locator: DIMM101
        Bank Locator: BRANCH 1 CHANNEL 0 DIMM 1
Memory Device
        Size: 16384 MB
        Locator: DIMM110
        Bank Locator: BRANCH 1 CHANNEL 1 DIMM 0
Memory Device
        Size: 16384 MB
        Locator: DIMM111
        Bank Locator: BRANCH 1 CHANNEL 1 DIMM 1
Memory Device
        Size: 16384 MB
        Locator: DIMM120
        Bank Locator: BRANCH 1 CHANNEL 2 DIMM 0
Memory Device
        Size: 16384 MB
        Locator: DIMM121
        Bank Locator: BRANCH 1 CHANNEL 2 DIMM 1
Memory Device
        Size: 16384 MB
        Locator: DIMM130
        Bank Locator: BRANCH 1 CHANNEL 3 DIMM 0
Memory Device
        Size: 16384 MB
        Locator: DIMM131
        Bank Locator: BRANCH 1 CHANNEL 3 DIMM 1
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章