目的
dmesg 中發現內存 ECC 校驗錯誤
檢測出有問題的內存位置
dmesg 信息
[ 4.745351] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 4.745359] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5.746989] EDAC MC0: 27609 CE memory read error on CPU_SrcID#0_Channel#1_DIMM#0 (channel:1 slot:0 page:0x105649c offset:0x6c0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 channel_mask:2 rank:1)
[ 5.747001] EDAC MC0: 23245 CE memory scrubbing error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x105649e offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c3 socket:0 channel_mask:1 rank:1)
[ 300.644412] mce: [Hardware Error]: Machine check events logged
獲取內存錯誤信息
grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:23245 <- 校驗錯誤 dimm 0, channel 0, branch 0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:27609 <- 校驗錯誤 dimm 0, channel 1, branch 0
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch3_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch2_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch3_ce_count:0
參考文檔信息
# yum install -y kernel-doc
已加載插件:fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* updates: mirrors.sh.vclound.com
正在解決依賴關係
--> 正在檢查事務
---> 軟件包 kernel-doc.noarch.0.3.10.0-514.6.2.el7 將被 安裝
--> 解決依賴關係完成
依賴關係解決
==============================================================================================================================
Package 架構 版本 源 大小
==============================================================================================================================
正在安裝:
kernel-doc noarch 3.10.0-514.6.2.el7 updates 15 M
事務概要
==============================================================================================================================
安裝 1 軟件包
總下載量:15 M
安裝大小:48 M
Downloading packages:
kernel-doc-3.10.0-514.6.2.el7.noarch.rpm | 15 MB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
正在安裝 : kernel-doc-3.10.0-514.6.2.el7.noarch 1/1
驗證中 : kernel-doc-3.10.0-514.6.2.el7.noarch 1/1
已安裝:
kernel-doc.noarch 0:3.10.0-514.6.2.el7
完畢!
參考下面信息
vim /usr/share/doc/kernel-doc-3.10.0/Documentation/edac.txt
Channel 0 Channel 1
===================================
csrow0 | DIMM_A0 | DIMM_B0 |
csrow1 | DIMM_A0 | DIMM_B0 |
===================================
===================================
csrow2 | DIMM_A1 | DIMM_B1 |
csrow3 | DIMM_A1 | DIMM_B1 |
===================================
從上面可以看出,這裏分兩部分內存組, mc0, mc1
內存組 mc0 中第一第二內存 ECC 故障
即 DIMM 0 中的 channel 0 與 channel 1 位置
獲取內存位置
dmidecode -t memory | grep -E 'Memory Device|Size:|Locator'
Memory Device
Size: 16384 MB
Locator: DIMM000
Bank Locator: BRANCH 0 CHANNEL 0 DIMM 0 <- 故障
Memory Device
Size: No Module Installed
Locator: DIMM001
Bank Locator: BRANCH 0 CHANNEL 0 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM010
Bank Locator: BRANCH 0 CHANNEL 1 DIMM 0 <- 故障
Memory Device
Size: No Module Installed
Locator: DIMM011
Bank Locator: BRANCH 0 CHANNEL 1 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM020
Bank Locator: BRANCH 0 CHANNEL 2 DIMM 0
Memory Device
Size: No Module Installed
Locator: DIMM021
Bank Locator: BRANCH 0 CHANNEL 2 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM030
Bank Locator: BRANCH 0 CHANNEL 3 DIMM 0
Memory Device
Size: No Module Installed
Locator: DIMM031
Bank Locator: BRANCH 0 CHANNEL 3 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM100
Bank Locator: BRANCH 1 CHANNEL 0 DIMM 0
Memory Device
Size: 16384 MB
Locator: DIMM101
Bank Locator: BRANCH 1 CHANNEL 0 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM110
Bank Locator: BRANCH 1 CHANNEL 1 DIMM 0
Memory Device
Size: 16384 MB
Locator: DIMM111
Bank Locator: BRANCH 1 CHANNEL 1 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM120
Bank Locator: BRANCH 1 CHANNEL 2 DIMM 0
Memory Device
Size: 16384 MB
Locator: DIMM121
Bank Locator: BRANCH 1 CHANNEL 2 DIMM 1
Memory Device
Size: 16384 MB
Locator: DIMM130
Bank Locator: BRANCH 1 CHANNEL 3 DIMM 0
Memory Device
Size: 16384 MB
Locator: DIMM131
Bank Locator: BRANCH 1 CHANNEL 3 DIMM 1