參考鏈接:
http://kaldi-asr.org/doc/queue.html kaldi官網
http://www.softpanorama.org/HPC/Grid_engine/Installation/installation_of_execution_host.shtml Installationof the Grid Engine Execution Host
http://gridscheduler.sourceforge.net/CompileGridEngineSource.html
http://blog.csdn.net/leijunan/article/details/39608849集羣環境配置
SGE編譯
環境:centos 7 64位
這個系統需要自己編譯源碼,是比較麻煩的,弄好後才知道選擇其他linux的發行版本,可以直接下載
gridengine-master gridengine-client
gridengine-client gridengine-exec
1、 下載GE2011.11p1.tar.gz 對應6.2u5版本
訪問http://gridscheduler.sourceforge.net/ ,然後轉到Download GridEngine/Grid Scheduler
標籤下下載
2、 解壓
tar –zxvf GE2011.11p1.tar.gz
3、 執行以下指令,編譯GE
cd GE2011.11p1/source
./aimk -no-java -no-jni -no-secure -spool-classic -no-dump -only-depend
./scripts/zerodepend
./aimk -no-java -no-jni -no-secure -spool-classic -no-dump depend
./aimk -no-java -no-jni -no-secure -spool-classic -no-dump
如果出錯一般是因爲配置和系統軟件安裝不匹配造成的,以下是安裝過程中出現的錯誤信息
1 |
執行到 % ./aimk -no-java -no-jni -no-secure -spool-classic -no-dump 時,報: ../utilbin/authuser.c:72:31: 致命錯誤:security/pam_appl.h:沒有那個文件或目錄#include <security/pam_appl.h> |
解決 辦法 |
檢security目錄下沒有pam_appl.h,因爲pam沒裝好,下載openpam-20130907.tar.gz,編譯 cd openpam-20130907 ./configure sudo make install -------------------------- 重新執行SGE的編譯指令 |
2 |
In file included from ../Xmt310/Xmt/All.c:23:0: ../Xmt310/Xmt/Xmt.h:56:19: 致命錯誤:Xm/Xm.h:沒有那個文件或目錄 #include <Xm/Xm.h> |
解決 辦法 |
cd GE2011.11p1/source ./aimk -no-java -no-jni -no-secure -spool-classic -no-dump -only-depend ./scripts/zerodepend ./aimk -no-java -no-jni -no-secure -spool-classic -no-dump depend ./aimk -no-java -no-jni -no-secure -spool-classic -no-dump -no-qmon 這個配置是不編譯qmon,因爲系統沒裝X11,其他配置項,參考這個鏈接: http://gridscheduler.sourceforge.net/CompileGridEngineSource.html 這個地方要注意下,將-no-qmon配置到 ./aimk -no-java -no-jni -no-secure -spool-classic -no-dump depend 指令,貌似也會報錯,要放到最後一行指令,具體我也不清楚什麼原因。 |
3 |
rm -f gethost gcc -o gethost -DSGE_ARCH_STRING=\"linux-x64\" -O3 -Wall -Wstrict-prototypes -DUSE_POLL -DLINUX -DLINUXX64 -DLINUXX64 -D_GNU_SOURCE -DGETHOSTBYNAME_R6 -DGETHOSTBYADDR_R8 -DHAS_VSNPRINTF -DHAS_IN_PORT_T -I/build/berkeleydb/include/ -DTARGET_64BIT -DSPOOLING_classic -Wno-strict-aliasing -DNO_JNI -DCOMPILE_DC -D__SGE_COMPILE_WITH_GETTEXT__ -D__SGE_NO_USERMAPPING__ -DTHREADBINDING -DHWLOC -Wno-error -DPROG_NAME='"qtcsh"' -DLINUXX64 -I. -I.. -D_PATH_TCSHELL='"/usr/local/bin/tcsh"' -I../../../libs/gdi -I../../../libs/gdi ../gethost.c -lncurses -lcrypt -L../../../LINUXX64 -R/lib/linux-x64 -L/build/berkeleydb/lib/ -L. -Wl,-rpath,\$ORIGIN/../../lib/linux-x64 -lsge -lpthread -ldl gcc: 錯誤:unrecognized command line option‘-R’ |
|
使用GE2011.11.tar.gz版本時報錯,表面上看是’-R’參數的問題,但gcc一般不會出這樣的錯誤,應該是沒有將-no-qmon配置到最後一條令: ./aimk -no-java -no-jni -no-secure -spool-classic -no-dump depend的原因,由於後面使用GE2011.11p1.tar.gz編譯成功了,所以後面我沒有測試了。 |
4、 配置環境變量
mkdir /opt/ge2011
export SGE_ROOT=/opt/ge2011
export cell=default
5、 執行:scripts/distinst -all -local –noexit
這條指令將install_qmaster、install_execd等安裝在$SGE_ROOT下面
1 |
Installing: sge_qmaster sge_execd sge_shadowd sge_shepherd sge_coshepherd qstat qsub qalter qconf qdel qacct qmod qsh utilbin jobs qmon qhost qmake qtcsh qping qloadsensor.exe sgepasswd qquota qrsub qrstat qrdel common Architectures: –noexit Base directory: /opt/ge2011 OK [Y/N][Y]: OK [Y/N][Y]: y
Installing "3rd_party/" directory tree cp: 無法獲取"dist/3rd_party" 的文件狀態(stat): 沒有那個文件或目錄
This command failed: cp -r dist/3rd_party /opt/ge2011 Installation failed. Exiting. |
解決辦法 |
這個錯誤是路徑問題,我直接到distinst目錄下執行: ./distinst -all -local –noexit 導致腳本相對路徑不正確,所以無法找到dist/3rd_party文件夾。 到scripts目錄下,再執行scripts/distinst -all -local –noexit就沒問題了 |
2 |
Installing "3rd_party/" directory tree Installing "inst_sge", "install_qmaster" and "install_execd" Installing "util/" directory tree chmod: 無法訪問"/opt/ge2011/util/DetectJvmLibrary.jar": 沒有那個文件或目錄
This command failed: chmod 644 /opt/ge2011/util/install_modules/backup_template.conf /opt/ge2011/util/install_modules/DB_CONFIG /opt/ge2011/util/install_modules/inst_berkeley.sh /opt/ge2011/util/install_modules/inst_common.sh /opt/ge2011/util/install_modules/inst_execd.sh /opt/ge2011/util/install_modules/inst_execd_uninst.sh /opt/ge2011/util/install_modules/inst_qmaster.sh /opt/ge2011/util/install_modules/inst_qmaster_uninst.sh /opt/ge2011/util/install_modules/inst_schedd_high.conf /opt/ge2011/util/install_modules/inst_schedd_max.conf /opt/ge2011/util/install_modules/inst_schedd_normal.conf /opt/ge2011/util/install_modules/inst_st.sh /opt/ge2011/util/install_modules/inst_template.conf /opt/ge2011/util/rctemplates/darwin_template /opt/ge2011/util/rctemplates/sgebdb_template /opt/ge2011/util/rctemplates/sgeexecd_template /opt/ge2011/util/rctemplates/sgemaster_template /opt/ge2011/util/sgeCA/sge_ca.cnf /opt/ge2011/util/sgeCA/sge_ssl.cnf /opt/ge2011/util/sgeCA/sge_ssl_template.cnf /opt/ge2011/util/sgeSMF/bdb_template.xml /opt/ge2011/util/sgeSMF/execd_template.xml /opt/ge2011/util/sgeSMF/qmaster_template.xml /opt/ge2011/util/sgeSMF/shadowd_template.xml /opt/ge2011/util/sgeSMF/sge_smf_support.sh /opt/ge2011/util/DetectJvmLibrary.jar /opt/ge2011/util/resources/calendars/day /opt/ge2011/util/resources/calendars/day_s /opt/ge2011/util/resources/calendars/night /opt/ge2011/util/resources/calendars/night_s /opt/ge2011/util/resources/centry/arch /opt/ge2011/util/resources/centry/calendar /opt/ge2011/util/resources/centry/cpu /opt/ge2011/util/resources/centry/display_win_gui /opt/ge2011/util/resources/centry/h_core /opt/ge2011/util/resources/centry/h_cpu /opt/ge2011/util/resources/centry/h_data /opt/ge2011/util/resources/centry/h_fsize /opt/ge2011/util/resources/centry/hostname /opt/ge2011/util/resources/centry/h_rss /opt/ge2011/util/resources/centry/h_rt /opt/ge2011/util/resources/centry/h_stack /opt/ge2011/util/resources/centry/h_vmem /opt/ge2011/util/resources/centry/load_avg /opt/ge2011/util/resources/centry/load_long /opt/ge2011/util/resources/centry/load_medium /opt/ge2011/util/resources/centry/load_short /opt/ge2011/util/resources/centry/m_core /opt/ge2011/util/resources/centry/mem_free /opt/ge2011/util/resources/centry/mem_total /opt/ge2011/util/resources/centry/mem_used /opt/ge2011/util/resources/centry/min_cpu_interval /opt/ge2011/util/resources/centry/m_socket /opt/ge2011/util/resources/centry/m_topology /opt/ge2011/util/resources/centry/m_topology_inuse /opt/ge2011/util/resources/centry/np_load_avg /opt/ge2011/util/resources/centry/np_load_long /opt/ge2011/util/resources/centry/np_load_medium /opt/ge2011/util/resources/centry/np_load_short /opt/ge2011/util/resources/centry/num_proc /opt/ge2011/util/resources/centry/qname /opt/ge2011/util/resources/centry/rerun /opt/ge2011/util/resources/centry/s_core /opt/ge2011/util/resources/centry/s_cpu /opt/ge2011/util/resources/centry/s_data /opt/ge2011/util/resources/centry/seq_no /opt/ge2011/util/resources/centry/s_fsize /opt/ge2011/util/resources/centry/slots /opt/ge2011/util/resources/centry/s_rss /opt/ge2011/util/resources/centry/s_rt /opt/ge2011/util/resources/centry/s_stack /opt/ge2011/util/resources/centry/s_vmem /opt/ge2011/util/resources/centry/swap_free /opt/ge2011/util/resources/centry/swap_rate /opt/ge2011/util/resources/centry/swap_rsvd /opt/ge2011/util/resources/centry/swap_total /opt/ge2011/util/resources/centry/swap_used /opt/ge2011/util/resources/centry/tmpdir /opt/ge2011/util/resources/centry/virtual_free /opt/ge2011/util/resources/centry/virtual_total /opt/ge2011/util/resources/centry/virtual_used /opt/ge2011/util/resources/pe/make /opt/ge2011/util/resources/pe/make.sge_pqs_api /opt/ge2011/util/resources/schemas/qhost/qhost.xsd /opt/ge2011/util/resources/schemas/qquota/qquota.xsd /opt/ge2011/util/resources/schemas/qrstat/qrstat.xsd /opt/ge2011/util/resources/schemas/qstat/detailed_job_info_cb.xsd /opt/ge2011/util/resources/schemas/qstat/detailed_job_info.xsd /opt/ge2011/util/resources/schemas/qstat/message.xsd /opt/ge2011/util/resources/schemas/qstat/qstat_cb.xsd /opt/ge2011/util/resources/schemas/qstat/qstat.xsd /opt/ge2011/util/resources/usersets/arusers /opt/ge2011/util/resources/usersets/deadlineusers /opt/ge2011/util/resources/usersets/defaultdepartment
Installation failed. Exiting. |
解決辦法 |
scripts/distinst -all -local –noexit scripts/distinst -all -local -noexit 錯誤提示是沒有DetectJvmLibrary.jar這個文件,我們編譯的時候已經配置了-no-java,沒有是正常的,但修改權限時沒檢測到,如果沒有設置-noexit就會中斷執行,而noexit前的斜槓太詭異了,回車就變了,只能說幸好發現了。 出現同樣問題的鏈接: https://sourceforge.net/p/gridscheduler/mailman/message/35610855/ |
到目前爲止,SGE已經安裝好了,下面就是配置SGE了
6、 修改集羣的端口號
/etc/services
集羣需要兩個沒有用過的集羣端口號,默認的爲
sge_qmaster 6444/tcp sge-qmaster # Grid Engine Qmaster Service
sge_qmaster 6444/udp sge-qmaster # Grid Engine Qmaster Service
sge_execd 6445/tcp sge-execd # Grid Engine Execution Service
sge_execd 6445/udp sge-execd # Grid Engine Execution Service
修改爲不常用的端口號:
sge_qmaster 27100/tcp
sge_qmaster 27100/udp
sge_execd 27101/tcp
sge_execd 27101/udp
注:端口號設置需要在每臺準備用作集羣的電腦上進行。
7、 網絡文件系統配置NFS
NFS是網絡文件系統,用作集羣中主控主機和執行主機間文件的傳輸,局域網內的傳輸是非常快的!
7.1 配置主機名
在所有主機上,按照下面命令打開主機名文件:
vim /etc/hosts
依次添加想作爲執行主機的主機名,格式如下:
192.168.0.21 hostname1
192.168.0.22 hostname2
….
IP hostnameN
Note: IP即每臺主機ip地址,可通過命令 % ifconfig 查看
hostname可通過命令 % hostname 查看
7.2配置共享目錄文件
在準備作爲主控主機的電腦上,通過命令:
vim /etc/exports
打開配置文件按如下格式進行編輯:
(本機掛載版本)
/opt/ge2011 192.168.1.216(rw,insecure,no_all_squash,no_root_squash,sync)
/usr/wxf/kaldi192.168.1.216(rw,insecure,no_all_squash,no_root_squash,sync)
Note: 第一列爲待共享的路徑,第二列爲允許共享的ip,括號中爲共享類型;/opt/ge2011爲集羣的根目錄,/home/kaldi爲kaldi安裝路徑。
Note: 如果配置192.168.1.216,那就只能在本機上掛載,在其他主機上掛載則會報:
mount.nfs: access denied by server whilemounting 192.168.1.216:/opt/ge2011
錯誤。
所以正確的配置應該是:
(其他主機掛載版本)
/opt/ge2011*(rw,insecure,no_all_squash,no_root_squash,sync)
/usr/wxf/kaldi *(rw,insecure,no_all_squash,no_root_squash,sync)
然後通過以下命令將配置寫入系統中。
exportfs –av
7.3在每臺執行主機上掛載主控主機的文件
創建要掛載的文件夾
mkdir /opt/ge2011 /usr/wxf /usr/wxf/kaldi
mount 192.168.1.216:/opt/ge2011 /opt/ge2011
mount 192.168.1.216:/usr/wxf/kaldi /usr/wxf/kaldi
Note:server是主控主機的ip或者主機名,第三列爲掛載點。
掛載成功的檢驗:
1)輸入命令後沒有報錯,
2)執行主機上通過命令:mount可以查看到掛載的路徑,
3)並且,在每臺執行主機上,cd到/opt/ge2014和/home/kaldi路徑下,能夠看到主控主機在這個路徑下的所有文件。如此,則mount成功
mount出錯的原因分析:
1)NFS服務未開啓,通過以下命令在主控主機和執行主機上開啓:
service rpcbind restart
service nfs restart
重啓後重新配置下防火牆的端口過濾,或者配置NFS固定的端口
2)防火牆的問題:打開配置文件,加入紅字部分內容,保存
2.1)iptables防火牆版本
vim /etc/sysconfig/iptables
# Firewall configuration written by system-config-firewall # Manual customization of this file is not recommended. *filter <span style="color:#FF0000;"> -A INPUT -m state --state NEW -m tcp -p tcp --dport 2049 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 111 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 32803 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 892 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 875 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 662 -j ACCEPT</span> :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT -A INPUT -p icmp -j ACCEPT -A INPUT -i lo -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT -A INPUT -j REJECT --reject-with icmp-host-prohibited -A FORWARD -j REJECT --reject-with icmp-host-prohibited COMMIT |
service iptables restart
2.2)firewalld版本
firewall-cmd--add-service=nfs
firewall-cmd --reload
如果以上仍不能解決,可能要關閉防火牆
chkconfig iptables off
或者:
systemctl stopfirewalld.service
systemctldisable firewalld.service #禁止firewall開機啓動
firewall-cmd–state #查看防火牆狀態
Note:nfs參考網址:
http://www.unixmen.com/nfs-server-installation-and-configuration-in-centos-6-3-rhel-6-3-and-scientific-linux-6-3/
1 |
本地mount沒有問題。 在執行主機上mount出現: mount.nfs: access denied by server while mounting 192.168.1.216:/opt/ge2011 1) ping得通 2) rpcbind nfs啓動了的 [root@hadoop-0 wxf]# service rpcbind restart Redirecting to /bin/systemctl restart rpcbind.service [root@hadoop-0 wxf]# service nfs restart Redirecting to /bin/systemctl restart nfs.service rpcinfo -p localhost 3)關閉防火牆 systemctl stop firewalld.service systemctl disable firewalld.service #禁止firewall開機啓動 firewall-cmd –state |
解決辦法 |
這就是因爲NFS配置的文件夾權限問題 /opt/ge2011。192.168.1.216(rw,insecure,no_all_squash,no_root_squash,sync) 應該設置爲: /opt/ge2011 *(rw,insecure,no_all_squash,no_root_squash,sync) |
主控主機安裝
1、
以root用戶進入到SGE目錄下:
cd $SGE_ROOT
2、
新建文件hostlist,依次輸入執行主機名,每個名字佔一行,如下:
hostname1
hostname2
…
hostnameN
3、
安裝執行install_qmaster,
流程在“主節點安裝.docx”文檔裏
主節點安裝:
重要的地方
under an user id other than >root< (y/n) [y] >>y
Please enter a valid user name >> sgeadmin
Are you going to install Windows Execution Hosts? (y/n) [n] >>回車
Do you want to enable the JMX MBeanserver (y/n) [n] >>回車
Please enter a range [20000-20100]>>2000-21000
Do you want to use a file which contains the list of hosts (y/n) [n]>>y
Please enter the file name which containsthe host list:hostlist
Do you want to add your shadow host(s)now? (y/n) [y] >>n
執行主機安裝
流程在“執行節點安裝.docx”文檔裏
執行節點:
1、創建用戶:
sudo adduser sgeadmin
2、設置sge的端口:
Vim /etc/services 修改端口爲:
sge_qmaster 27100/tcp sge-qmaster # Grid EngineQmaster Service
sge_qmaster 27100/udp sge-qmaster # Grid EngineQmaster Service
sge_execd 27101/tcp sge-execd # Grid Engine Execution Service
sge_execd 27101/udp sge-execd # Grid Engine Execution Service
Note: 1)所有主機設置爲一樣的
2)注意重複設置,以致端口沒有修改成功
3)如果找不到主節點,需要到主節點操作防火牆開放上面的端口
[root@hadoop-0 /]# firewall-cmd --zone=public --add-port=27100/tcp--permanent
[root@hadoop-0 /]# firewall-cmd --zone=public --add-port=27100/udp--permanent
[root@hadoop-0 /]# firewall-cmd --zone=public --add-port=27101/udp--permanent
[root@hadoop-0 /]# firewall-cmd --zone=public --add-port=27101/tcp--permanent
[root@hadoop-0/]# firewall-cmd –reload
3、執行:/opt/ge2011/default/common/settings.sh,設置環境變量,否則後面運行會有問題。
4、執行:/opt/ge2011/install_execd,下面是執行的過程.注意的點:
Do you want to configure a different spool directory
for this host (y/n) [n] >>y
Enter the spool directory now! >>/home/sgeadmin/hadoop-0
最後
設置開機啓動
1、 將./etc/init.d/sgemaster.p27100和./etc/init.d/sgeexecd.p27100寫道
/etc/rc.local 中
2、將./opt/ge2011/default/common/settings.sh寫到/etc/profile
測試
工具集在該路勁下:
/opt/ge2011/bin/linux-x64
配置執行主機
./opt/ge2011/bin/linux-x64/qconf -sel
qconf -ae hostname |
添加執行主機 |
qconf -de hostname |
刪除執行主機 |
qconf -sel |
顯示執行主機列表 |
配置管理主機
./opt/ge2011/bin/linux-x64/qconf -sh
qconf -ah hostname |
添加管理主機 |
qconf -dh hostname |
刪除管理主機 |
qconf -sh |
顯示管理主機列表 |
配置提交主機
./opt/ge2011/bin/linux-x64/qconf -ss
qconf -as hostname |
添加提交主機 |
qconf -ds hostname |
刪除提交主機 |
qconf -ss |
顯示提交主機列表 |
配置隊列
qconf -aq queuename |
添加集羣隊列 |
qconf -dq queuename |
刪除集羣隊列 |
qconf -mq queuename |
修改集羣隊列配置 |
qconf -sq queuename |
顯示集羣隊列配置 |
qconf -sql |
顯示集羣隊列列表 |
配置用戶組
qconf -ahgrp groupname |
添加用戶組 |
qconf -mhgrp groupname |
修改用戶組成員 |
qconf -shgrp groupname |
顯示用戶組成員 |
主機狀態
./opt/ge2011/bin/linux-x64/qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
hadoop-0 linux-x64 4 0.18 15.4G 5.3G 7.8G 0.0
集羣狀態
./opt/ge2011/bin/linux-x64/qstat -f
集羣狀態信息如下:
queuename qtyperesv/used/tot.load_avg arch states
-------------------------------------------------------------------------------
all.q@hadoop-0 BIP 0/0/4 0.17 linux-x64
-------------------------------------------------------------------------------
all.q@hadoop-2 BIP 0/0/4 0.58 linux-x64
-------------------------------------------------------------------------------
all.q@hadoop-5 BIP 0/0/4 0.01 linux-x64
-------------------------------------------------------------------------------
all.q@hadoop-8 BIP 0/0/4 0.01 linux-x64
4個執行節點,tot表示核心貢獻數,可以看到都是4核的