1、故障描述
接到用戶報障,生產某系統無法訪問。同事接到報障後立即排查,經測試,系統確實無法訪問,並且無法ping通服務器。
2、故障處理
由於客戶端無法ping通服務器,需要進入機房查看。經查看,服務器硬件無報警,系統無重啓。登錄系統使用ifconfig命令查看,IP丟失(eth0不存在),緊接打開網卡配置目錄/etc/sysconfig/network-scripts,發現網卡文件ifcfg-eth0丟失,只存在之前備份的ifcfg-eth0.bak文件和ifcfg-peth0文件。根據先搶通業務後處理故障原則,通過備份的文件複製一份進行修復,重啓network服務,故障解決。
3、故障分析
3.1經瞭解,故障發生時,有一同事正在登錄系統查詢安全基線配置,但同事堅稱並未進行rm或者mv網卡文件操作。通過history命令得知,該同事確實未執行rm或者mv操作,只執行了chkconfig --list命令,但卻不小心把原本需要複製的內容誤操作的當作命令去執行了,歷史記錄如下:
883 chkconfig --list 884 NetworkManager 0:off 1:off 2:off 3:off 4:off 5:off 6:off 885 PowerIscsi 0:off 1:off 2:off 3:on 4:off 5:on 6:off 886 PowerMig 0:off 1:off 2:off 3:on 4:off 5:on 6:off 887 PowerMigRecoverAll 0:off 1:off 2:off 3:on 4:off 5:on 6:off 888 acpid 0:off 1:off 2:on 3:on 4:on 5:on 6:off 889 anacron 0:off 1:off 2:on 3:on 4:on 5:on 6:off 890 atd 0:off 1:off 2:off 3:on 4:on 5:on 6:off 891 auditd 0:off 1:off 2:on 3:on 4:on 5:on 6:off 892 autofs 0:off 1:off 2:off 3:on 4:on 5:on 6:off 893 avahi-daemon 0:off 1:off 2:off 3:on 4:on 5:on 6:off 894 avahi-dnsconfd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 895 bluetooth 0:off 1:off 2:on 3:on 4:on 5:on 6:off 896 capi 0:off 1:off 2:off 3:off 4:off 5:off 6:off 897 conman 0:off 1:off 2:off 3:off 4:off 5:off 6:off 898 coremail 0:off 1:off 2:on 3:on 4:on 5:on 6:off 899 cpuspeed 0:off 1:on 2:on 3:on 4:on 5:on 6:off 900 crond 0:off 1:off 2:on 3:on 4:on 5:on 6:off 901 cups 0:off 1:off 2:on 3:on 4:on 5:on 6:off 902 dnsmasq 0:off 1:off 2:off 3:off 4:off 5:off 6:off 903 dund 0:off 1:off 2:off 3:off 4:off 5:off 6:off 904 ebtables 0:off 1:off 2:off 3:off 4:off 5:off 6:off 905 firstboot 0:off 1:off 2:off 3:on 4:off 5:on 6:off 906 gpm 0:off 1:off 2:on 3:on 4:on 5:on 6:off 907 haldaemon 0:off 1:off 2:off 3:on 4:on 5:on 6:off 908 hidd 0:off 1:off 2:on 3:on 4:on 5:on 6:off 909 hplip 0:off 1:off 2:on 3:on 4:on 5:on 6:off 910 httpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 911 ip6tables 0:off 1:off 2:on 3:on 4:on 5:on 6:off 912 ipmi 0:off 1:off 2:off 3:off 4:off 5:off 6:off 913 iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off 914 irda 0:off 1:off 2:off 3:off 4:off 5:off 6:off 915 irqbalance 0:off 1:off 2:on 3:on 4:on 5:on 6:off 916 iscsi 0:off 1:off 2:off 3:on 4:on 5:on 6:off 917 iscsid 0:off 1:off 2:off 3:on 4:on 5:on 6:off 918 isdn 0:off 1:off 2:on 3:on 4:on 5:on 6:off 919 kdump 0:off 1:off 2:off 3:off 4:off 5:off 6:off 920 kudzu 0:off 1:off 2:off 3:on 4:on 5:on 6:off 921 libvirt-guests 0:off 1:off 2:off 3:on 4:on 5:on 6:off 922 libvirtd 0:off 1:off 2:off 3:on 4:on 5:on 6:off 923 lvm2-monitor 0:off 1:on 2:on 3:on 4:on 5:on 6:off 924 mcstrans 0:off 1:off 2:on 3:on 4:on 5:on 6:off 925 mdmonitor 0:off 1:off 2:on 3:on 4:on 5:on 6:off 926 mdmpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 927 messagebus 0:off 1:off 2:off 3:on 4:on 5:on 6:off 928 microcode_ctl 0:off 1:off 2:on 3:on 4:on 5:on 6:off 929 multipathd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 930 named 0:off 1:off 2:off 3:off 4:off 5:off 6:off 931 netbackup 0:off 1:off 2:on 3:on 4:off 5:on 6:off 932 netconsole 0:off 1:off 2:off 3:off 4:off 5:off 6:off 933 netfs 0:off 1:off 2:off 3:on 4:on 5:on 6:off 934 netplugd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 935 network 0:off 1:off 2:on 3:on 4:on 5:on 6:off 936 nfs 0:off 1:off 2:off 3:off 4:off 5:off 6:off 937 nfslock 0:off 1:off 2:off 3:on 4:on 5:on 6:off 938 nscd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 939 ntpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 940 pand 0:off 1:off 2:off 3:off 4:off 5:off 6:off 941 pcscd 0:off 1:off 2:on 3:on 4:on 5:on 6:off 942 portmap 0:off 1:off 2:off 3:on 4:on 5:on 6:off 943 psacct 0:off 1:off 2:off 3:off 4:off 5:off 6:off 944 rawdevices 0:off 1:off 2:off 3:on 4:on 5:on 6:off 945 rdisc 0:off 1:off 2:off 3:off 4:off 5:off 6:off 946 readahead_early 0:off 1:off 2:on 3:on 4:on 5:on 6:off 947 readahead_later 0:off 1:off 2:off 3:off 4:off 5:on 6:off 948 restorecond 0:off 1:off 2:on 3:on 4:on 5:on 6:off 949 rhnsd 0:off 1:off 2:off 3:on 4:on 5:on 6:off 950 rpcgssd 0:off 1:off 2:off 3:on 4:on 5:on 6:off 951 rpcidmapd 0:off 1:off 2:off 3:on 4:on 5:on 6:off 952 rpcsvcgssd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 953 saslauthd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 954 sendmail 0:off 1:off 2:off 3:off 4:off 5:off 6:off
以上操作記錄表面看起來,並無異常。
3.2通過查看系統日誌messages,發現有“removed ifcfg-eth0”關鍵字,發生的時間與同事誤操作的時間吻合,如下:
Mar 21 09:46:50 localhost nm-system-settings: ifcfg-rh: removed /etc/sysconfig/network-scripts/ifcfg-eth0. Mar 21 09:46:50 localhost nm-system-settings: ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-peth0 ... Mar 21 09:46:50 localhost nm-system-settings: ifcfg-rh: read connection 'System peth0' Mar 21 09:46:50 localhost nm-system-settings: ifcfg-rh: updating /etc/sysconfig/network-scripts/ifcfg-peth0
同事既然沒有誤操作,那爲什麼會有remove網卡文件的日誌呢?難道被黑了?還是有其它原因?
3.3查看日誌secure和命令last,並未發現異常登錄IP,先排除被黑可能性,着重排查同事誤操作的命令中,哪一條纔是引起網卡文件丟失的。
3.4再一次確認3.1的history操作記錄,表面看上去確實沒有什麼異常,而且都是chkconfig --list的輸出內容,百思不得其解。
3.5查問題,看日誌。只能通過仔細的分析message日誌查找一點蛛絲馬跡。從3.2的日誌來看,當看到
Mar 21 09:46:50 localhost nm-system-settings: ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-peth0 ...
時,發現“ifcfg-peth0”這個網卡文件很可疑,該文件應該跟XEN虛擬化有關,但這個系統並未使用到XEN虛擬化。
3.6登錄系統確認,系統雖未使用虛擬化,但前期安裝時安裝了XEN虛擬化,並且加載了kernel-xen內核和啓動了xend服務:
1)[root@~]# uname -r 2.6.18-238.el5xen 2)# /etc/init.d/xend status xend is running
3.7查看Ifcfg-peth0文件的創建修改時間,與同事誤操作的時間吻合,再一次懷疑這個文件跟這次故障有一定的關係:
# find . -type f -mtime 2|xargs ls -l -rw-r--r-- 1 root root 303 Mar 21 09:46 ./etc/modprobe.conf -rw-r--r-- 1 root root 23116 Mar 21 09:46 ./etc/sysconfig/hwconf -rw-r--r-- 1 root root 122 Mar 21 09:46 ./etc/sysconfig/network-scripts/ifcfg-peth0
3.8爲方便排查和重現故障,根據系統的環境,在測試環境搭建:安裝了XEN虛擬化RHEL5.6。
3.8.1跟生產系統一樣,同樣的備份一份Ifcfg-eth0.bak文件;
3.8.2根據同事誤操作的歷史記錄,逐條進行執行測試,當測試到“kudzu 0:off 1:off 2:off 3:on 4:on 5:on 6:off”,問題重現:ifcfg-eth0文件丟失,同時生成了ifcfg-peth0文件,並且斷網。與生產系統故障的情況一致。如圖:
3.9搭建另一個測試環境:並未安裝XEN虛擬化的RHEL5.6。同樣的執行3.8.2章節的命令,但問題未重現。如圖:
4、故障原因
通過問題重現,得出結論:安裝了XEN虛擬化環境的系統,同事誤操作的時候執行了其中一條“kudzu 0:off 1:off 2:off 3:on 4:on 5:on 6:off”命令,兩者條件滿足情況下,從而導致刪除了ifcfg-eth0文件,繼而發生斷網。
5、相關知識
根據網上信息瞭解,kudzu命令爲什麼會導致刪除網卡配置文件,目前所瞭解的,應該是在特定情況下(安裝了XEN虛擬化)觸發的BUG或者本身的機制導致。
附:
1、kudzu介紹:http://blog.csdn.net/huyangg/article/details/7189743
2、kudzu相關BUG:https://bugzilla.redhat.com/show_bug.cgi?id=206910、https://bugzilla.redhat.com/show_bug.cgi?id=229579、http://linux.bigresource.com/Red-Hat-Prevent-kudzu-from-changing-ifcfg-ethX-file--wi29JYmpf.html