OpenMPI, LSF, InfiniBand, Mellanox OFED and Intel MPI Benchmark: what is going on ?

I usually use the Intel MPI Benchmark (IMB) as a quick test to check everything is ok from a network point of view (connectivity and performance)

My test cluster for today:

OpenMPI version: 1.6.3
Mellanox OFED version: MLNX_OFED_LINUX-1.5.3-4.0.8 (OFED-1.5.3-4.0.8)
Clustering suite: IBM Platform HPC 3.2 cluster (RHEL 6.2)
Nodes:
- 1 head node
- 6 compute nodes (5 package based install + 1 diskless)

Note: the diskless node is not used here

1. MLX OFED install

The OS_OFED components need to be disabled for both the compute (all that are in use, i.e. compute-rhel and compute-diskless-rhel in my case) and installer nodegroups.
You can just mount the ISO image and run the install script. The HCA firmware will be flashed if needed during the process.

2. Open MPI install and configuration

2.1 download OpenMPI and uncompress the tarball

mkdir /shared/compile_temp/openmpi/
cd /shared/compile_temp/openmpi/
wget http://www.open-mpi.org/software/ompi/v1.6/downloads/openmpi-1.6.3.tar.bz2
tar -xjf openmpi-1.6.3.tar.bz2

2.2 Configure, compile and install OpenMPI

Note: if LSF is in the PATH, then support for it will be automatically enabled. If not sure, you can force it:

cd openmpi-1.6.3
./configure --prefix=/shared/ompi-1.6.3-lsf --enable-orterun-prefix-by-default --with-lsf=/opt/lsf/8.3 --with-lsf-libdir=/opt/lsf/8.3/linux2.6-glibc2.3-x86_64/lib

Note: the path used for the option --with-lsf-libdir is $LSF_LIBDIR

make

make check
make install

2.3 create an environment module for OpenMPI

2.3.1 add the new PATH to MODULEPATH

I added the following to my .bashrc: export MODULEPATH=$MODULEPATH:/home/mehdi/modules

2.3.2 create the module file (/home/mehdi/modules/ompilsf):

#%Module1.0
##
## dot modulefile
##
proc ModulesHelp { } {
global openmpiversion

puts stderr "\tAdds OpenMPI to your environment variables,"
}

module-whatis "adds OpenMPI to your environment variables"

set openmpiversion 1.6.3
set root /shared/ompi-1.6.3-lsf/

prepend-path PATH $root/bin
prepend-path MANPATH $root/man
setenv MPI_HOME $root
setenv MPI_RUN $root/bin/mpirun

prepend-path LD_RUN_PATH $root/lib
prepend-path LD_LIBRARY_PATH $root/lib

2.3.3 check the module works as expected

The first step is to check which modules are currently loaded:

[mehdi@atsplat1 ~]$ module list
Currently Loaded Modulefiles:
1) null

Then we can check for the available modules:

[mehdi@atsplat1 ~]$ module avail

----------------------------------------------------- /usr/share/Modules/modulefiles -----------------------------------------------------
PMPI/modulefile dot module-cvs module-info modules null use.own

---------------------------------------------------------- /home/mehdi/modules -----------------------------------------------------------
ompi ompilsf

And finally load the module we just created and double check all the environment is properly definedL

[mehdi@atsplat1 ~]$ module load ompilsf
[mehdi@atsplat1 ~]$ module list
Currently Loaded Modulefiles:
1) null 2) ompilsf
[mehdi@atsplat1 ~]$ echo $MPI_HOME
/shared/ompi-1.6.3-lsf/
[mehdi@atsplat1 ~]$ which mpicc
/shared/ompi-1.6.3-lsf/bin/mpicc
[mehdi@atsplat1 ~]$

2.4 check OpenMPI

2.4.1 verify support for openib and lsf are there:

[mehdi@atsplat1 ~]$ ompi_info | grep openib
                 MCA btl: openib (MCA v2.0, API v2.0, Component v1.6.3)
[mehdi@atsplat1 ~]$ ompi_info | grep lsf
                  Prefix: /shared/ompi-1.6.3-lsf
                 MCA ras: lsf (MCA v2.0, API v2.0, Component v1.6.3)
                 MCA plm: lsf (MCA v2.0, API v2.0, Component v1.6.3)
                 MCA ess: lsf (MCA v2.0, API v2.0, Component v1.6.3)
[mehdi@atsplat1 ~]$

2.4.1 copy over an example from the OpenMPI distribution and modify it so the output contains the hostname of the node running the MPI instance:

cp /shared/compile_temp/openmpi/openmpi-1.6.3/examples/hello_c.c .
cp hello_c.c hello_c.orig
vim hello_c.c

The diff is below:

[mehdi@atsplat1 ~]$ diff -U4 hello_c.orig hello_c.c
--- hello_c.orig   2013-02-13 13:04:42.074961444 -0600
+++ hello_c.c   2013-02-13 13:06:02.216989733 -0600
@@ -12,13 +12,15 @@

int main(int argc, char* argv[])
{
     int rank, size;
+    char hostname[256];

     MPI_Init(&argc, &argv);
     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
     MPI_Comm_size(MPI_COMM_WORLD, &size);
-    printf("Hello, world, I am %d of %d\n", rank, size);
+    gethostname(hostname,255);
+    printf("Hello, world, I am %d of %d on host %s\n", rank, size, hostname);
     MPI_Finalize();

     return 0;
}
[mehdi@atsplat1 ~]$

2.4.2 compile the modified example:

mpicc -o hello hello_c.c

2.4.3 run the example manually, using a host file and outside of LSF

My hostfile is:

[mehdi@atsplat1 ~]$ cat hosts
compute002
compute004
[mehdi@atsplat1 ~]$

2.4.3.1 run with no specific options and fix any problem

[mehdi@atsplat1 ~]$ mpirun -np 2 --hostfile ./hosts ./hello
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory. This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered. You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host:              compute002
Registerable memory:     4096 MiB
Total memory:            32745 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--------------------------------------------------------------------------
Hello, world, I am 0 of 2 on host compute002
Hello, world, I am 1 of 2 on host compute004
[atsplat1:24675] 1 more process has sent help message help-mpi-btl-openib.txt / reg mem limit low
[atsplat1:24675] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

As you can see, the first try is not really great. Let see if everything is fine when using the TCP fabric:

[mehdi@atsplat1 ~]$ mpirun -np 2 --hostfile ./hosts --mca btl self,tcp ./hello
Hello, world, I am 1 of 2 on host compute004
Hello, world, I am 0 of 2 on host compute002
[mehdi@atsplat1 ~]$

So the problem really comes from my IB setup. Looking at the page mentioned in the output of the job (the first one that used the IB fabric), there are 2 causes for the warning: one related to the Linux kernel and a second one to the locked memory limits.

The max locked memory should not be an issue as my /etc/security/limits.conf on the compute nodes ends with a nice:

* soft memlock unlimited
* hard memlock unlimited

So my issue must be at kernel level

And the following article on developerworks explains very well how to solve the problem:

Looking at the kernel module parameters tells me that I need to change them:

[root@compute004 ~]# cat /sys/module/mlx4_core/parameters/log_num_mtt
0
[root@compute004 ~]# cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
0
[root@compute004 ~]#

And the PAGE_SIZE value is

[root@compute004 ~]# getconf PAGE_SIZE
4096
[root@compute004 ~]#

My nodes have 32G memory so:

max_reg_mem = (2^23) * (2^1) * (4 kB) = 64 GB

which means:

log_num_mtt = 23

log_mtts_per_seg = 1

The way to get the proper values is to add the following line to /etc/modprobe.d/mlx4_en.conf :

options mlx4_core log_num_mtt=23 log_mtts_per_seg=1

Then we need to restart openibd so the modules are unloaded / reloaded and check the new values of log_mtts_per_seg and log_num_mtt:

[root@compute004 ~]# cat /sys/module/mlx4_core/parameters/log_num_mtt
23
[root@compute004 ~]# cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
1

Perfect, now it is time to run the example again:

[mehdi@atsplat1 ~]$ mpirun -np 2 --hostfile ./hosts ./hello
Hello, world, I am 1 of 2 on host compute004
Hello, world, I am 0 of 2 on host compute002
[mehdi@atsplat1 ~]$

So now we can force the use of the IB fabric only:

[mehdi@atsplat1 ~]$ mpirun -np 2 --hostfile ./hosts --mca btl self,openib ./hello
Hello, world, I am 1 of 2 on host compute004
Hello, world, I am 0 of 2 on host compute002
[mehdi@atsplat1 ~]$

2.4.3.2 Final fix in the IBM Platform HPC framework

We use the CFM framework to append the options to the /etc/modprobe.d/mlx4_en.conf file.
In the CFM directory for the compute nodegroup in use. In my case:

/etc/cfm/compute-rhel-6.2-x86_64eth0/

create the appropriate directory and file:

etc/modprobe.d/mlx4_en.conf.append

The mlx4_en.conf.append file contains only 2 lines:

Then run cfmsync -f -n compute-rhel-6.2-x86_64eth0

2.5 compile IMB

mkdir /shared/compile_temp/IMB/
cd /shared/compile_temp/IMB/

Assuming the tarball is already there:

tar -xzf IMB_3.2.3.tgz

Then we need to go in the src directory and create a make file for OpenMPI:

[root@atsplat1 IMB]# cd imb_3.2.3/src/

[root@atsplat1 src]# cp make_mpich make_ompi

The setup of MPI_HOME is commented out, as the module file will set it up.

The diff:

[root@atsplat1 src]# diff -U4 make_mpich make_ompi
--- make_mpich 2011-11-07 16:09:42.000000000 -0600
+++ make_ompi 2013-02-13 09:36:45.616993735 -0600
@@ -1,6 +1,6 @@
# Enter root directory of mpich install
-MPI_HOME=
+#MPI_HOME=

MPICC=$(shell find ${MPI_HOME} -name mpicc -print)

NULL_STRING :=
[root@atsplat1 src]#

Time to compile:

[root@atsplat1 src]# make -f make_ompi

2.6 run IMB to get get the bandwidth associated with each fabric

One of my node showed errors for the InfiniBand card so from now I'll be using a different host file:

[mehdi@atsplat1 ~]$ cat hosts
compute005
compute002
[mehdi@atsplat1 ~]$

2.6.1 run using native InfiniBand:

[mehdi@atsplat1 ~]$ mpirun -np 2 --mca btl self,openib --hostfile ./hosts ./IMB-MPI1 pingpong
benchmarks to run pingpong
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.3, MPI-1 part
#---------------------------------------------------
# Date                  : Thu Feb 14 08:45:12 2013
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.32-220.el6.x86_64
# Version               : #1 SMP Wed Nov 9 08:03:13 EST 2011
# MPI Version           : 2.1
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time

# Calling sequence was:

# ./IMB-MPI1 pingpong

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         1.50         0.00
            1         1000         1.49         0.64
            2         1000         1.26         1.52
            4         1000         1.25         3.05
            8         1000         1.30         5.89
           16         1000         1.30        11.69
           32         1000         1.33        22.87
           64         1000         1.37        44.60
          128         1000         1.98        61.61
          256         1000         2.12       115.41
          512         1000         2.29       213.54
         1024         1000         2.63       370.96
         2048         1000         3.46       563.82
         4096         1000         4.14       942.99
         8192         1000         5.63      1387.80
        16384         1000         8.12      1924.25
        32768         1000        11.63      2686.07
        65536          640        18.67      3347.84
       131072          320        32.46      3851.17
       262144          160        60.21      4152.16
       524288           80       115.53      4327.82
      1048576           40       226.39      4417.21
      2097152           20       447.33      4471.00
      4194304           10       890.95      4489.61

# All processes entering MPI_Finalize

[mehdi@atsplat1 ~]$

The bandwidth and latency are consistents with the hardware capability:

[root@compute004 ~]# ibstatus
Infiniband device 'mlx4_0' port 1 status:
    default gid:    fe80:0000:0000:0000:5cf3:fc00:0004:ec2b
    base lid:    0x0
    sm lid:        0x0
    state:        1: DOWN
    phys state:    3: Disabled
    rate:        40 Gb/sec (4X QDR)
    link_layer:    InfiniBand

Infiniband device 'mlx4_0' port 2 status:
    default gid:    fe80:0000:0000:0000:5cf3:fc00:0004:ec2c
    base lid:    0x10
    sm lid:        0x2
    state:        4: ACTIVE
    phys state:    5: LinkUp
    rate:        40 Gb/sec (4X FDR10)
    link_layer:    InfiniBand

[root@compute004 ~]#

2.6.2 run using the TCP fabric: surprise !

[mehdi@atsplat1 ~]$ mpirun -np 2 --mca btl self,tcp --hostfile ./hosts ./IMB-MPI1 pingpong

...

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        17.15         0.00
            1         1000        17.35         0.05
            2         1000        17.34         0.11
            4         1000        17.36         0.22
            8         1000        17.41         0.44
           16         1000        17.43         0.88
           32         1000        17.74         1.72
           64         1000        18.55         3.29
          128         1000        19.08         6.40
          256         1000        20.74        11.77
          512         1000        23.93        20.41
         1024         1000        29.15        33.50
         2048         1000       269.92         7.24
         4096         1000       271.10        14.41
         8192         1000       272.24        28.70
        16384         1000       273.65        57.10
        32768         1000       289.68       107.88
        65536          640       790.10        79.10
       131072          320      1054.85       118.50
       262144          160      1397.37       178.91
       524288           80      2552.95       195.85
      1048576           40      4664.89       214.37
      2097152           20      9089.75       220.03
      4194304           10     18022.91       221.94

# All processes entering MPI_Finalize

[mehdi@atsplat1 ~]$

The big surprise is the bandwidth, which is 2 times what is expected for a Gb network. This is explained by how OpenMPI is able to aggregate links when they are in compatible networks. This is explained in more details here. Another noticeable effect of this aggregation is that you are limited by the slowest interface, this is why in this case we are getting 2 times the bandwidth for a Gb interface.

2.6.3 run using only the eth0 interface

[mehdi@atsplat1 ~]$ mpirun -np 2 --mca btl self,tcp --mca btl_tcp_if_include eth0 --hostfile ./hosts ./IMB-MPI1 pingpong

...

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        24.52         0.00
            1         1000        24.52         0.04
            2         1000        24.53         0.08
            4         1000        24.53         0.16
            8         1000        24.50         0.31
           16         1000        24.53         0.62
           32         1000        24.53         1.24
           64         1000        24.50         2.49
          128         1000        24.51         4.98
          256         1000        24.55         9.95
          512         1000        29.45        16.58
         1024         1000        47.34        20.63
         2048         1000       551.01         3.54
         4096         1000       547.24         7.14
         8192         1000       545.31        14.33
        16384         1000       551.50        28.33
        32768         1000       559.61        55.84
        65536          640      1056.22        59.17
       131072          320      1391.63        89.82
       262144          160      2544.80        98.24
       524288           80      4658.02       107.34
      1048576           40      9076.14       110.18
      2097152           20     18080.40       110.62
      4194304           10     35776.65       111.80

# All processes entering MPI_Finalize

[mehdi@atsplat1 ~]$

2.6.4 run using only the IPoIB interface

[mehdi@atsplat1 ~]$ mpirun -np 2 --mca btl self,tcp --mca btl_tcp_if_include ib1 --hostfile ./hosts ./IMB-MPI1 pingpong
...
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         9.33         0.00
            1         1000         9.54         0.10
            2         1000         9.54         0.20
            4         1000         9.57         0.40
            8         1000         9.54         0.80
           16         1000         9.53         1.60
           32         1000         9.54         3.20
           64         1000         9.57         6.38
          128         1000         9.72        12.56
          256         1000        10.42        23.43
          512         1000        10.73        45.50
         1024         1000        11.32        86.25
         2048         1000        12.62       154.76
         4096         1000        14.84       263.19
         8192         1000        18.66       418.77
        16384         1000        26.97       579.31
        32768         1000        47.33       660.20
        65536          640        95.72       652.96
       131072          320       120.42      1038.04
       262144          160       198.02      1262.50
       524288           80       342.29      1460.73
      1048576           40       656.27      1523.77
      2097152           20      1273.80      1570.10
      4194304           10      2526.25      1583.38

# All processes entering MPI_Finalize

[mehdi@atsplat1 ~]$

2.7 Using LSF to submit jobs

2.7.1 basic submission and how to get information about the LSF job

As mentioned earlier, OpenMPI was compiled with support for LSF, which means we can use mpirun natively in bsub scripts / invocations. For example, the following will request 2 job slots (equivalent by default to 2 cores) on 2 different nodes:

[mehdi@atsplat1 ~]$ bsub -n 2 -R span[ptile=1] mpirun ./IMB-MPI1 pingpong
Job <3525> is submitted to default queue <medium_priority>.

We can see which nodes are running the job:

[mehdi@atsplat1 ~]$ bjobs
JOBID   USER    STAT QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
3525    mehdi   RUN   medium_pri atsplat1    compute002 * pingpong Feb 14 09:03
                                             compute003

And also get detailed information about the job itself, once done:
[mehdi@atsplat1 ~]$ bhist -l 3525

Job <3525>, User <mehdi>, Project <default>, Command <mpirun ./IMB-MPI1 pingpon
                     g>
Thu Feb 14 09:03:49: Submitted from host <atsplat1>, to Queue <medium_priority>
                     , CWD <$HOME>, 2 Processors Requested, Requested Resources
                      <span[ptile=1]>;
Thu Feb 14 09:03:51: Dispatched to 2 Hosts/Processors <compute002> <compute003>
                     ;
Thu Feb 14 09:03:51: Starting (Pid 28169);
Thu Feb 14 09:03:51: Running with execution home </home/mehdi>, Execution CWD <
                     /home/mehdi>, Execution Pid <28169>;
Thu Feb 14 09:03:53: Done successfully. The CPU time used is 1.3 seconds;
Thu Feb 14 09:04:01: Post job process done successfully;

Summary of time in seconds spent in various states by Thu Feb 14 09:04:01
PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
2        0        2        0        0        0        4

[mehdi@atsplat1 ~]$

2.7.2 how to get the output and verify everything is fine

The best way to double check the information given from LSF is exact is to run our hello job again and get the output written in a file. In our case we'll tag the output and error files with the job number (-o%J.out and -e%J.err):

[mehdi@atsplat1 ~]$ bsub -o%J.out -e%J.err -n 2 -R span[ptile=1] mpirun ./hello
Job <3527> is submitted to default queue <medium_priority>.
[mehdi@atsplat1 ~]$
[mehdi@atsplat1 ~]$ bjobs
JOBID   USER    STAT QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
3527    mehdi   RUN   medium_pri atsplat1    compute003 *n ./hello Feb 14 09:10
                                             compute004
[mehdi@atsplat1 ~]$ ls 3527*
3527.err 3527.out
[mehdi@atsplat1 ~]$

Good news, the error file is empty and the output file shows a perfect match between the hostname retrieved as part as the MPI job and the nodes allocated by LSF:

[mehdi@atsplat1 ~]$ cat 3527.err
[mehdi@atsplat1 ~]$ cat 3527.out
Sender: LSF System <hpcadmin@compute003>
Subject: Job 3527: <mpirun ./hello> Done

Job <mpirun ./hello> was submitted from host <atsplat1> by user <mehdi> in cluster <atsplat1_cluster1>.
Job was executed on host(s) <1*compute003>, in queue <medium_priority>, as user <mehdi> in cluster <atsplat1_cluster1>.
                            <1*compute004>
</home/mehdi> was used as the home directory.
</home/mehdi> was used as the working directory.
Started at Thu Feb 14 09:10:20 2013
Results reported at Thu Feb 14 09:10:28 2013

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun ./hello
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :      0.19 sec.
    Max Memory :         1 MB
    Max Swap   :        30 MB

    Max Processes :         1
    Max Threads    :         1

The output (if any) follows:

Hello, world, I am 0 of 2 on host compute003
Hello, world, I am 1 of 2 on host compute004

PS:

Read file <3527.err> for stderr output of this job.

[mehdi@atsplat1 ~]$

2.7.2 Submitting the IMB jobs through LSF

2.7.2.1 using the IB fabric

[mehdi@atsplat1 ~]$ bsub -o%J.out -e%J.err -n 2 -R span[ptile=1] mpirun --mca btl self,openib ./IMB-MPI1 pingpong
Job <3529> is submitted to default queue <medium_priority>.

And the output:

[mehdi@atsplat1 ~]$ cat 3529.out
Sender: LSF System <hpcadmin@compute005>
Subject: Job 3529: <mpirun --mca btl self,openib ./IMB-MPI1 pingpong> Done

Job <mpirun --mca btl self,openib ./IMB-MPI1 pingpong> was submitted from host <atsplat1> by user <mehdi> in cluster <atsplat1_cluster1>.
Job was executed on host(s) <1*compute005>, in queue <medium_priority>, as user <mehdi> in cluster <atsplat1_cluster1>.
                            <1*compute006>
</home/mehdi> was used as the home directory.
</home/mehdi> was used as the working directory.
Started at Thu Feb 14 09:28:54 2013
Results reported at Thu Feb 14 09:28:56 2013

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun --mca btl self,openib ./IMB-MPI1 pingpong
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :      1.25 sec.
    Max Memory :         1 MB
    Max Swap   :        30 MB

    Max Processes :         1
    Max Threads    :         1

The output (if any) follows:

benchmarks to run pingpong
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.3, MPI-1 part
#---------------------------------------------------
# Date                  : Thu Feb 14 09:28:55 2013
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.32-220.el6.x86_64
# Version               : #1 SMP Wed Nov 9 08:03:13 EST 2011
# MPI Version           : 2.1
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time

# Calling sequence was:

# ./IMB-MPI1 pingpong

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         1.42         0.00
            1         1000         1.48         0.65
            2         1000         1.29         1.48
            4         1000         1.25         3.06
            8         1000         1.27         5.99
           16         1000         1.28        11.90
           32         1000         1.30        23.46
           64         1000         1.34        45.67
          128         1000         1.97        61.82
          256         1000         2.11       115.84
          512         1000         2.29       213.60
         1024         1000         2.64       370.53
         2048         1000         3.46       564.81
         4096         1000         4.14       943.10
         8192         1000         5.65      1383.23
        16384         1000         8.13      1922.95
        32768         1000        11.67      2678.25
        65536          640        18.59      3361.36
       131072          320        32.47      3849.49
       262144          160        60.17      4155.19
       524288           80       115.51      4328.54
      1048576           40       226.20      4420.87
      2097152           20       447.60      4468.26
      4194304           10       890.91      4489.79

# All processes entering MPI_Finalize

PS:

Read file <3529.err> for stderr output of this job.

[mehdi@atsplat1 ~]$

2.7.2.2 using the TCP fabric

[mehdi@atsplat1 ~]$ bsub -o%J.out -e%J.err -n 2 -R span[ptile=1] mpirun --mca btl self,tcp ./IMB-MPI1 pingpong
Job <3533> is submitted to default queue <medium_priority>.

The output:

[mehdi@atsplat1 ~]$ cat 3533.out
Sender: LSF System <hpcadmin@compute002>
Subject: Job 3533: <mpirun --mca btl self,tcp ./IMB-MPI1 pingpong> Done

Job <mpirun --mca btl self,tcp ./IMB-MPI1 pingpong> was submitted from host <atsplat1> by user <mehdi> in cluster <atsplat1_cluster1>.
Job was executed on host(s) <1*compute002>, in queue <medium_priority>, as user <mehdi> in cluster <atsplat1_cluster1>.
                            <1*compute005>
</home/mehdi> was used as the home directory.
</home/mehdi> was used as the working directory.
Started at Thu Feb 14 09:37:45 2013
Results reported at Thu Feb 14 09:37:58 2013

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun --mca btl self,tcp ./IMB-MPI1 pingpong
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :     25.44 sec.
    Max Memory :         1 MB
    Max Swap   :        30 MB

    Max Processes :         1
    Max Threads    :         1

The output (if any) follows:

benchmarks to run pingpong
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.3, MPI-1 part
#---------------------------------------------------
# Date                  : Thu Feb 14 09:37:46 2013
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.32-220.el6.x86_64
# Version               : #1 SMP Wed Nov 9 08:03:13 EST 2011
# MPI Version           : 2.1
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time

# Calling sequence was:

# ./IMB-MPI1 pingpong

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        15.44         0.00
            1         1000        15.64         0.06
            2         1000        15.63         0.12
            4         1000        15.67         0.24
            8         1000        15.68         0.49
           16         1000        15.75         0.97
           32         1000        15.98         1.91
           64         1000        16.89         3.61
          128         1000        17.40         7.01
          256         1000        18.98        12.87
          512         1000        22.07        22.12
         1024         1000        27.37        35.68
         2048         1000       151.39        12.90
         4096         1000       183.53        21.28
         8192         1000       272.62        28.66
        16384         1000       273.90        57.05
        32768         1000       284.92       109.68
        65536          640       787.86        79.33
       131072          320      1054.46       118.54
       262144          160      1398.88       178.71
       524288           80      2559.74       195.33
      1048576           40      4671.30       214.07
      2097152           20      9101.65       219.74
      4194304           10     17931.90       223.07

# All processes entering MPI_Finalize

PS:

Read file <3533.err> for stderr output of this job.

[mehdi@atsplat1 ~]$

2.7.2.3 using the interface eth0 only

[mehdi@atsplat1 ~]$ bsub -o%J.out -e%J.err -n 2 -R span[ptile=1] mpirun --mca btl self,tcp --mca btl_tcp_if_include eth0 ./IMB-MPI1 pingpong
Job <3535> is submitted to default queue <medium_priority>.

And the output:

[mehdi@atsplat1 ~]$ cat 3535.out
Sender: LSF System <hpcadmin@compute002>
Subject: Job 3535: <mpirun --mca btl self,tcp --mca btl_tcp_if_include eth0 ./IMB-MPI1 pingpong> Done

Job <mpirun --mca btl self,tcp --mca btl_tcp_if_include eth0 ./IMB-MPI1 pingpong> was submitted from host <atsplat1> by user <mehdi> in cluster <atsplat1_cluster1>.
Job was executed on host(s) <1*compute002>, in queue <medium_priority>, as user <mehdi> in cluster <atsplat1_cluster1>.
                            <1*compute005>
</home/mehdi> was used as the home directory.
</home/mehdi> was used as the working directory.
Started at Thu Feb 14 09:40:37 2013
Results reported at Thu Feb 14 09:40:58 2013

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun --mca btl self,tcp --mca btl_tcp_if_include eth0 ./IMB-MPI1 pingpong
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :     40.69 sec.
    Max Memory :         1 MB
    Max Swap   :        30 MB

    Max Processes :         1
    Max Threads    :         1

The output (if any) follows:

benchmarks to run pingpong
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.3, MPI-1 part
#---------------------------------------------------
# Date                  : Thu Feb 14 09:40:38 2013
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.32-220.el6.x86_64
# Version               : #1 SMP Wed Nov 9 08:03:13 EST 2011
# MPI Version           : 2.1
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time

# Calling sequence was:

# ./IMB-MPI1 pingpong

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        24.55         0.00
            1         1000        24.50         0.04
            2         1000        24.52         0.08
            4         1000        24.52         0.16
            8         1000        24.50         0.31
           16         1000        24.52         0.62
           32         1000        24.57         1.24
           64         1000        24.65         2.48
          128         1000        24.89         4.90
          256         1000        29.11         8.39
          512         1000        51.64         9.45
         1024         1000        47.34        20.63
         2048         1000       218.24         8.95
         4096         1000       240.50        16.24
         8192         1000       245.51        31.82
        16384         1000       324.48        48.15
        32768         1000       704.06        44.39
        65536          640      1338.30        46.70
       131072          320      1702.87        73.41
       262144          160      2638.45        94.75
       524288           80      4792.39       104.33
      1048576           40      9190.72       108.81
      2097152           20     18132.53       110.30
      4194304           10     35818.20       111.68

# All processes entering MPI_Finalize

PS:

Read file <3535.err> for stderr output of this job.

[mehdi@atsplat1 ~]$

2.7.2.4 using the interface ib1 only

[mehdi@atsplat1 ~]$ bsub -o%J.out -e%J.err -n 2 -R span[ptile=1] mpirun --mca btl self,tcp --mca btl_tcp_if_include ib1 ./IMB-MPI1 pingpong
Job <3534> is submitted to default queue <medium_priority>.

[mehdi@atsplat1 ~]$ cat 3534.out
Sender: LSF System <hpcadmin@compute002>
Subject: Job 3534: <mpirun --mca btl self,tcp --mca btl_tcp_if_include ib1 ./IMB-MPI1 pingpong> Done

Job <mpirun --mca btl self,tcp --mca btl_tcp_if_include ib1 ./IMB-MPI1 pingpong> was submitted from host <atsplat1> by user <mehdi> in cluster <atsplat1_cluster1>.
Job was executed on host(s) <1*compute002>, in queue <medium_priority>, as user <mehdi> in cluster <atsplat1_cluster1>.
                            <1*compute005>
</home/mehdi> was used as the home directory.
</home/mehdi> was used as the working directory.
Started at Thu Feb 14 09:38:26 2013
Results reported at Thu Feb 14 09:38:29 2013

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun --mca btl self,tcp --mca btl_tcp_if_include ib1 ./IMB-MPI1 pingpong
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :      4.05 sec.
    Max Memory :         1 MB
    Max Swap   :        30 MB

    Max Processes :         1
    Max Threads    :         1

The output (if any) follows:

benchmarks to run pingpong
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.3, MPI-1 part
#---------------------------------------------------
# Date                  : Thu Feb 14 09:38:27 2013
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.32-220.el6.x86_64
# Version               : #1 SMP Wed Nov 9 08:03:13 EST 2011
# MPI Version           : 2.1
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time

# Calling sequence was:

# ./IMB-MPI1 pingpong

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         9.20         0.00
            1         1000         9.43         0.10
            2         1000         9.48         0.20
            4         1000         9.47         0.40
            8         1000         9.45         0.81
           16         1000         9.45         1.62
           32         1000         9.46         3.23
           64         1000         9.48         6.44
          128         1000         9.62        12.69
          256         1000        10.35        23.58
          512         1000        10.67        45.75
         1024         1000        11.37        85.87
         2048         1000        12.60       155.04
         4096         1000        15.05       259.50
         8192         1000        18.58       420.59
        16384         1000        26.89       581.07
        32768         1000        47.50       657.89
        65536          640        95.89       651.77
       131072          320       120.43      1037.97
       262144          160       197.98      1262.77
       524288           80       342.43      1460.15
      1048576           40       657.13      1521.78
      2097152           20      1275.20      1568.38
      4194304           10      2520.35      1587.08

# All processes entering MPI_Finalize

PS:

Read file <3534.err> for stderr output of this job.

[mehdi@atsplat1 ~]$

Reprinted from:點擊打開鏈接

OpenMPI, LSF, InfiniBand, Mellanox OFED and Intel MPI Benchmark: what is going on ?

2.3 create an environment module for OpenMPI

2.3.1 add the new PATH to MODULEPATH

2.3.2 create the module file (/home/mehdi/modules/ompilsf):

2.3.3 check the module works as expected

2.4 check OpenMPI

2.4.1 verify support for openib and lsf are there:

2.4.1 copy over an example from the OpenMPI distribution and modify it so the output contains the hostname of the node running the MPI instance:

2.4.2 compile the modified example:

2.4.3 run the example manually, using a host file and outside of LSF

2.4.3.1 run with no specific options and fix any problem

2.4.3.2 Final fix in the IBM Platform HPC framework

2.5 compile IMB

2.6 run IMB to get get the bandwidth associated with each fabric

2.6.1 run using native InfiniBand:

2.6.2 run using the TCP fabric: surprise !

2.6.3 run using only the eth0 interface

2.6.4 run using only the IPoIB interface

2.7 Using LSF to submit jobs

2.7.1 basic submission and how to get information about the LSF job

2.7.2 how to get the output and verify everything is fine

2.7.2 Submitting the IMB jobs through LSF

2.7.2.1 using the IB fabric

2.7.2.2 using the TCP fabric

2.7.2.3 using the interface eth0 only

2.7.2.4 using the interface ib1 only

HTML頁面關於高分屏的設置

北歐瑞典挪威芬蘭瑞士TikTok海外網紅與YouTube博主的合作模式

druid數據源 xml配置

DELL服務器iDRAC IP地址能ping通，但執行ipmitool沒有響應

OpenMPI, LSF, InfiniBand, Mellanox OFED and Intel MPI Benchmark: what is going on ?

華爲機器IPMI、BIOS設置項和步驟

centos7 drbd9源碼安裝

Run the MPI PingPong benchmark

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結