centos下安裝slurm

centos下安裝slurm

控制節點node16
計算節點node16,node18


刪除安裝失敗的slurm

yum remove slurm  -y
cat /etc/passwd | grep slurm
userdel - r slurm

創建用戶

export SLURMUSER=412 
groupadd -g $SLURMUSER slurm 
useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm

查看slurm用戶組id是否一致,控制節點和所有計算節點都要一樣

id slurm

安裝slurm

先裝epel庫:

yum install epel-release

裝slurm的依賴包:

yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad -y

如果出現以下報錯:
這裏寫圖片描述
直接卸載衝突部分再重新運行上述命令即可:

yum -y remove ibacm-1.2.0-1.el7.x86_64
yum -y remove libipathverbs-1.3-2.el7.x86_64   
yum -y remove ibacm-1.2.0-1.el7.x86_64      
yum -y remove libipathverbs-1.3-2.el7.x86_64

裝rpm:

yum install rpm-build

下載slurm:

wget https://www.schedmd.com/archives.php/downloads/archive/slurm-17.02.4.tar.bz2
rpmbuild -ta slurm-17.02.4.tar.bz2

對於控制節點和計算節點:

cd /root/rpmbuild/RPMS/x86_64
mkdir slurm-rpms
cp slurm-15.08.7-1.el7.centos.x86_64.rpm
slurm-devel-15.08.7-1.el7.centos.x86_64.rpm
slurm-munge-15.08.7-1.el7.centos.x86_64.rpm
slurm-pam_slurm-15.08.7-1.el7.centos.x86_64.rpm
slurm-perlapi-15.08.7-1.el7.centos.x86_64.rpm
slurm-plugins-15.08.7-1.el7.centos.x86_64.rpm
slurm-sjobexit-15.08.7-1.el7.centos.x86_64.rpm
slurm-sjstat-15.08.7-1.el7.centos.x86_64.rpm
slurm-slurmdbd-15.08.7-1.el7.centos.x86_64.rpm
slurm-slurmdb-direct-15.08.7-1.el7.centos.x86_64.rpm
slurm-sql-15.08.7-1.el7.centos.x86_64.rpm
slurm-torque-15.08.7-1.el7.centos.x86_64.rpm /slurm-rpms

對於計算節點:

yum --nogpgcheck localinstall slurm-15.08.7-1.el7.centos.x86_64.rpm slurm-devel-15.08.7-1.el7.centos.x86_64.rpm slurm-munge-15.08.7-1.el7.centos.x86_64.rpm slurm-perlapi-15.08.7-1.el7.centos.x86_64.rpm slurm-plugins-15.08.7-1.el7.centos.x86_64.rpm slurm-sjobexit-15.08.7-1.el7.centos.x86_64.rpm slurm-sjstat-15.08.7-1.el7.centos.x86_64.rpm slurm-torque-15.08.7-1.el7.centos.x86_64.rpm

配置slurm.conf:

cd /etc/slurm 
cp slurm.conf.example slurm.conf
vim slurm.conf

以下是配置好的slurm.conf:

ControlMachine=node16
ControlAddr=10.192.168.116

#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=node
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
#NodeName=node16 CPUs=20 RealMemory=63989 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 State=IDLE
#NodeName=node11 CPUs=20 RealMemory=62138 Sockets=4 CoresPerSocket=5 ThreadsPerCore=1 State=IDLE
#NodeName=node18 CPUs=20 RealMemory=64306 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 State=IDLE
#PartitionName=control Nodes=node16 Default=NO MaxTime=INFINITE State=UP
#PartitionName=compute Nodes=node16,node11,node18 Default=YES MaxTime=INFINITE State=UP

NodeName=node16 NodeAddr=10.192.168.116 CPUs=20 State=UNKNOWN
#NodeName=node11 NodeAddr=10.192.168.111 CPUs=20 State=UNKNOWN
NodeName=node18 NodeAddr=10.192.168.118 CPUs=20 State=UNKNOWN
#PartitionName=debug Nodes=node16,node11,node18 Default=YES MaxTime=INFINITE State=UP
PartitionName=control Nodes=node16 Default=NO MaxTime=INFINITE State=UP
PartitionName=compute Nodes=node16,node18 Default=YES MaxTime=INFINITE State=UP

查看配置文件:

scontrol show config

將配置好的conf文件發送到計算節點:

scp slurm.conf root@snode18:/etc/slurm/slurm.conf

在控制節點上配置:

mkdir /var/spool/slurmctld 
chown slurm: /var/spool/slurmctld 
chmod 755 /var/spool/slurmctld 
touch /var/log/slurmctld.log 
chown slurm: /var/log/slurmctld.log 
touch /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log 
chown slurm: /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log

在計算節點上配置:

mkdir /var/spool/slurmctld 
chown slurm: /var/spool/slurmctld chmod 755 /var/spool/slurmctld 
touch /var/log/slurmctld.log 
chown slurm: /var/log/slurmctld.log 
touch /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log 
chown slurm: /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log

計算節點需關閉防火牆:

systemctl stop firewalld 
systemctl disable firewalld

測試slurm是否裝成功:

如果出現不能連接控制節點的報錯就重啓一下slurm

slurmd -C
/etc/init.d/slurm start

顯示結果如下:

ClusterName=(null) NodeName=node16 CPUs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=63989 TmpDisk=51175
UpTime=18-18:49:53

發佈了28 篇原創文章 · 獲贊 12 · 訪問量 3萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章