MPI compiler wrappers



Script to compile and link MPI programs in C.

mpigcc and mpiicc are Intel MPI specific (allows to choose the compiler from GNU or Intel).



Script to compile and link MPI programs in C++.

mpigxx and mpiicpc are Intel MPI specific (allows to choose the compiler from GNU or Intel).



Script to compile and link MPI programs in Fortran.

mpiifort is Intel MPI specific.

mpichversion[MPICH] Display MPICH version
mpiname[MVAPICH] Display MPI & compiler information.

Use -a option to print all information.

mpitune[IntelMPI] Automated tuning utility. It's used during Intel MPI library installation or after the cluster configuration change.
[IntelMPI] Set up Intel MPI related environmental variables (e.g. PATH, LD_LIBRARY_PATH.)
logconverter[MPICH] Convert between different MPE log format (if the program is compiled with -mpilog option, see below). Some of the logs are for Jumpshot visualization.

See here for more information.

parkill[MVAPICH] Kill all processes of a program (parallel kill).

MPI compiler wrapper specific switches

[MPICH] Compile the program which will generate logs/traces. Only one of the switches can be used at the same time.

-mpilog will generate MPE log. It uses -llmpe -lmpe linking options.
-mpitrace will generate MPE trace. It uses -ltmpe -lmpe linking options.
-mpianim will generate MPE real-time animation. It uses -lampe -lmpe linking options.

[IntelMPI] Compile the program which will generate logs/traces. The resulting executable is linked against the Intel Trace Collector library.

-t=log option will also trace the internals of Intel MPI library.

-mpe=option[MVAPICH] Compile the program which will generate logs/traces. option can be mpilog, mpitrace, mpianim, mpicheck..

Use -mpe=help to see all available options and what libraries it will link to.

Link with the MPI profiling interface.
Use pgm to compile and link MPI programs.

This is the same as setting the environmental variable CC, CLINKER, CCC, CCLINKER, F77, F77LINKER, F90, F90LINKER.

Show the compiler and command-line options which will be used in compiling/linking MPI programs.
-echo[MPICH, IntelMPI] Show exactly what the script is doing.
Basically it runs the mpicc script with -x option.
-fast[IntelMPI] Maximize speed across the entire program.
-mt_mpi[IntelMPI] Use thread safe version of the Intel MPI Library.
[MPICH] Whether to use shared libraries.

This is the same as setting the environmental variable MPICH_USE_SHLIB to yes.

[IntelMPI] Use static linking with Intel MPI library.
-show[MPICH, IntelMPI, OpenMPI] Show commands that would be used without running them.
[OpenMPI] Show commands and compiler flags that would be used for compiling and linking.

Running MPI programs


The old-school mpirun:
mpirun -v -machinefile <machine file name> -np <#processes> ./a.out
Open MPI's mpirun (and mpiexec) is a symbolic link to it's own job launcher orterun, so it accepts other options, such as -npernode (specify the number of processes per node), -x ENV1 (pass the environmental variable ENV1 to the MPI job), -bind-to-core (CPU-process pinning/affinity). In addition, Open MPI can obtain host names of nodes from many batch job schedulers (PBS, LSF, etc) so -machinefile and -np flags are generally not needed.


The mpd (multi-purpose daemon) approach:
mpdboot -v -n <#mpd's> -f <machine file name> -r <ssh or rsh>
mpiexec -np <#processes> ./a.out
where <#mpd's> is usually the number of unique host names in <machine file name>, e.g. `sort machinefile|uniq|wc -l`

In many cases one might add -genvall flag to mpiexec ... command line to pass all environmental variables in the current environment to the MPI job, or use -genvlist=ENV1,ENV2,.. to pass only environmental variables ENV1, ENV2, ... to the MPI job. These options are especially useful when the MPI executable needs environmental variables such as LD_LIBRARY_PATH to function properly.

Having the ability to pass environmental variables to the MPI job is one of major differences between the old-school mpirun and the new mpiexec.

Intel MPI

Intel MPI is based on MPICH 2 & mpd:
source $INTEL_MPI/bin/
mpdboot -v -n <#mpd's> -f <machine file name> -r <ssh or rsh>
mpiexec -ppn <#processes per machine> -np <#processes> ./a.out

Intel MPI can automatically choose the appropriate fabric to use. By default, it uses DAPL and will fall back to TCP if DAPL is not available. However, in some cases one might want to override this setting. To force Intel MPI to use InfiniBand only, add -IB in front of -n option; to use Myrinet MX only, use -MX.

If -MX is used, then Intel MPI will use TMI (Tag Matching Interface) as opposed to DAPL. Intel MPI's will then seek configuration info from /etc/tmi.conf or from whatever the environmental variable TMI_CONFIG (see below) points to.

Troubleshooting mpdboot

mpirun.ch_mx command-line options

mpirun.ch_mx accepts Myrinet-specific command-line options:

--mx-no-shmemDisable the shared memory based intra-node communication.
--mx-recv mode Set mode to blocking to enable process waiting for receiving messages without polling.

MPICH2 mpiexec environmental variables

Some of mpiexec's command-line options can be also specified in environmental vairables:

MPICH_DBGDisplay debugging information.

To use this feature, MPICH2 must be configured and compiled with --enable-g option.

MPICH_DBG_LEVELDebugging verbosity level; can be TERSE, TYPICAL, VERBOSE
MPICH_NO_LOCALSet to 1 to not use shared memory for intra-node communication
MPICHD_DBG_PGSet to YES to enable MPD to display debugging information
MPIEXEC_BNRSet to 1 to use the MPICH1 legacy BNR interface to spawn and manage processes.

Trivia: BNR stands for Bill, Brian, Nick, Ralph, and Rusty, who were the initial MPICH2 developers of the interface.

MPIEXEC_GDBSet to 1 to run the MPI job in GDB
MPIEXEC_MACHINEFILESet to the file name containing the list of hosts
MPIEXEC_MERGE_OUTPUTSet to 1 to merge output when running the MPI job in GDB
MPIEXEC_RSHrsh command to use; default is ssh -x
MPIEXEC_SHOW_LINE_LABELSSet to 1 to show MPI rank in each line of the output
MPIEXEC_TIMEOUTJob time out (in seconds)
MPIEXEC_TOTALVIEWSet to 1 to run the MPI job in TotalView

MVAPICH2 mpiexec environmental variables (version 1.5)

Some of mpiexec's command-line options can be also specified in environmental vairables:

MV2_CPU_MAPPINGProvide more fine-grained control of CPU-process pinning/affinity.

Note that none of MV2_ENABLE_AFFINITY or MV2_USE_SHARED_MEM environmental variables should be set to 0 if MV2_CPU_MAPPING is used.

MV2_CPU_BINDING_POLICYProvide fine-grained control of CPU-process pinning/affinity.

It's value is either bunch (default) or scatter.

MV2_DAPL_PROVIDERIf your MVAPICH2 is built with uDAPL-CH3, then this variable specifies the uDAPL-CH3 library to use. It can takes the following values:
  • ib0
  • ccil
  • gmg2
  • ibd0
MV2_ENABLE_AFFINITYSet to 0 to disable CPU-process pinning/affinity. Default is 1.
MV2_USE_XRCSet to 1 to enable eXtended Reliable Connection, a new transport layer in Mellanox HCAs to improve scalability in multi-core environments. It allows one single receive QP to be shared by multiple SRQs across one or more processes.

(Your MVAPICH2 must be built with XRC to use this feature)

Intel MPI environmental variables

(See here for special features and known issues)

I_MPI_ADJUST_collSelect the algorithm for collective operation coll and also the message size ranges for that algorithm to take effect.
I_MPI_PLATFORMSelect the set of algorithms and their parameters for collective operations. This is an undocumented environmental variable added since version 4.0 and it could change again in the future. (In 4.1 it is documented.) To see its effect, set I_MPI_DEBUG environmental variable to 6 and run the MPI program (if you recompile your MPI program with -g, you can see even more verbose output), then one can see something like this:
[0] MPI startup(): Recognition level=0. Platform code=1. Device=5
and the algorithms for collective operations (check Intel MPI reference manual for their meanings):
[0] MPI startup(): Allreduce: 3: 2097152-2147483647 & 512-2147483647
[0] MPI startup(): Allreduce: 4: 524288-2147483647 & 32-2147483647
[0] MPI startup(): Allreduce: 5: 0-2147483647 & 0-2147483647
For version 4.0.0, the value of this variable can be
  • auto: Recognition level=2.
  • trust: Recognition level=1, Platform code=1
  • generic: Recognition level=0, Platform code=1
  • zero: Recognition level=0, Platform code=-1 (What this really means is regardless of message sizes or MPI ranks, the same baseline algorithm is used.)
  • htn: (Harpertown) The same as generic
  • nhm: (Nehalem) Recognition level=0, Platform code=3
For version 4.0.1, the value of this variable can be
  • auto: Recognition level=2. Use this value when running on heterogeneous nodes, e.g. Westmere and Harpertown nodes, as suggested here.
  • generic: Recognition level=0, Platform code=1
  • zero: Recognition level=0, Platform code=-1
  • harpertown: The same as generic
  • nehalem: Recognition level=0, Platform code=4
  • westmere: Recognition level=0, Platform code=2
For version 4.0.3, the value of this variable can be
  • generic
  • harpertown
  • hetero
  • homo
  • htn (Harpertown)
  • nehalem
  • nhm (Nehalem)
  • snb (Sandy Bridge)
  • westmere
  • wsm (Westmere)
For version 4.1.0, the value of this variable can be
  • auto: (Default)
  • auto:max: For the newest supported Intel architecture processors across all nodes.
  • auto:most: For the most numerous Intel architecture processors across all nodes.
  • generic: Select the algorithm optimized for Harpertown.
  • harpertown
  • hetero
  • homo
  • hsw (Haswell)
  • htn (Harpertown)
  • ivb (Ivy Bridge)
  • knc (Knight's Corner)
  • nehalem
  • nhm (Nehalem)
  • none: Like zero above.
  • sandybridge
  • snb (Sandy Bridge)
  • uniform: Select algorithm optimized for local processors. The behavior is unpredictable if nodes are heterogeneous.
  • westmere
  • wsm (Westmere)
I_MPI_DAPL_PROVIDERSelect the particular DAPL provider.

See here for details.

By default, Intel MPI will try to load DAPL info from

I_MPI_DEBUGSet to 6 to display most debugging information; set to 9 to display the values of many undocumented environmental variables (those with I_MPI_INFO_ prefix)

Compile the code with -g compiler option will enable even more detailed debugging information (because the user program will be linked to instead of, by the Intel MPI compiler wrapper)

I_MPI_PRINT_ENV(Version 4.0.1) Set to 1 to display the values of Intel MPI environmental variables.
I_MPI_DEVICESelect the particular network to be used, e.g. shm, rdma, rdssm, sock, ssm
I_MPI_DYNAMIC_CONNECTION Set to 1 so all connections are established at the time of the first communication between each pair of processes.

Default is 1 if the number of processes is greater than 64, and 0 otherwise.

Set to the number of bytes which is the threshold between the eager and the rendezvous protocol.

Default is 262144 (256 KB)

Select the parallel file system to use in MPI-IO. Set I_MPI_EXTRA_FILESYSTEM to on and I_MPI_EXTRA_FILESYSTEM_LIST to one of the following values:
  • panfs (for Panasas)
  • pvfs2 (for PVFS2)
  • lustre (for Lustre)
I_MPI_FALLBACKSet to 0 to not fallback to the first available network, i.e. if an attempt to initialize a specified network fails, the job terminates immediately. Default is 1.
I_MPI_FABRICSSelect the particular network fabrics to be used, e.g. shm, dapl, tcp, tmi (Tag Matching Interface, e.g. QLogic PSM and Myricom MX), ofa
I_MPI_PINSet to 0 to disable CPU-process pinning/affinity. Default is 1
Provide more fine-grained control of CPU-process pinning/affinity and interoperability with OpenMP.
I_MPI_SCALABLE_OPTIMIZATION Set to 0 to disable optimization.

Default is 1 if the number of processes is greater than 16, and 0 otherwise.

I_MPI_SHM_BYPASS Set to 1 to disable the shared memory based intra-node communication.

Default is 0.

I_MPI_SPIN_COUNTSet to loop spin count when polling the network.

Default value is equal to 1 when more than one process runs per processor/core. Otherwise the value equals 250.

The loop for polling the network spins this many times before freeing the processes if no incoming messages are received for processing. Smaller values cause the MPI library to release the processor more frequently.

Use built-in statistics gathering facility to collect performance data.
I_MPI_TIMER_KINDSelect the timer routine for MPI_Wtime and MPI_Wtick calls, e.g. gettimeofday (default) or rdtsc
I_MPI_WAIT_MODESet to 1 to enable process waiting for receiving messages without polling.

Default is 0 (i.e. use polling).

Caveat: Unless one has a good reason, do not change the default setting, especially on Myrinet MX, or MPI collective operations could get stuck. The fix is to use TMI as oppose to the (default) DAPL. See the TMI_CONFIG environmental variable below.

TMI_CONFIG If one selects the Tag Matching Interface (e.g. specifying -MX or -PSM switch at mpiexec command), then by default, Intel MPI seeks TMI configuration info from
and this environmental variable can be set to point to the TMI config file one wishes to use.

One can find an example TMI config file at $I_MPI_ROOT/etc64/tmi.conf, which can be used without modification in most cases. See here for details.

TMI_DEBUGSet to a positive integer to display TMI debugging information.

Open MPI environmental variables

Open MPI has many MCA (Modular Component Architecture) parameters which can be set in the following different ways, and each approach overrides the later ones:

For example, to use the Checksum component in PML (Point-To-Point Management Layer) framework, use:

mpirun --mca pml csum ...

To see the default MCA parameters:

ompi_info --param all all

Some useful environmental variables:

OPAL_PREFIXSpecify the Open MPI installation root directory.
OMPI_MCA_mca_verboseSet to 1 for verbose mode.
OMPI_MCA_mpi_show_mca_paramsSet to 1 to show all MCA parameter values.
OMPI_MCA_mpi_yield_when_idleSet to any nonzero value for Degraded performance mode, or 0 for Aggressive performacne mode
OMPI_MCA_mpi_paffinity_aloneSet to any nonzero value to enable CPU-process pinning/affinity.
OMPI_MCA_mpi_leave_pinnedSet to 1 value to enable aggressive pre-registration of user message buffers for RDMA-enabled network fabrics.

See here for details.

OMPI_MCA_mpi_leave_pinned_pipelineSet to 1 value to enable pre-registration of user message buffers for RDMA pipeline protocol.

See the article High performance RDMA protocols in HPC by T.S. Woodall et al for details.

OMPI_MCA_btlOpen MPI can detect the network fabric and choose the best communication protocols for it. If you want to override, you can set OMPI_MCA_btl to the following values. You can specify multiple values (separated by commas) e.g. export OMPI_MCA_btl='gm,sm,self'
  • gm: Myrinet GM
  • mx: Myrinet Express (MX)
  • ofud: OpenFabrics unreliable datagrams
  • openib: OpenFabrics InfiniBand. See the FAQ for details.
  • portals: Cray Portals
  • self: Loop-back communication. This one should always be specified, or your Open MPI can behave erratically.
  • sm: Shared memory. See the FAQ for details.
  • tcp: TCP/IP. See the FAQ for details.
  • udapl: User level DAPL. To use this, make sure you have /etc/dat.conf ready. See the FAQ for details.

SGI MPI environmental variables

MPI_DISPLAY_SETTINGSSet to any value to display the values of SGI MPI environmental variables.
MPI_SLAVE_DEBUG_ATTACHSet to the MPI rank of the process you want to attach your debugger. When this environmental variable is set, the MPI job will wait for 20 seconds for you to launch the debugger. You will see something like this:
MPI rank 1 sleeping for 20 seconds while you attach the debugger.
You can use this debugger command:
     gdb /proc/30766/exe 30766
     idb -pid 30766 /proc/30766/exe
MPI_DAEMON_DEBUG_ATTACHSimilar to MPI_SLAVE_DEBUG_ATTACH environmental variable, except the point of attachment is earlier. If MPI_SLAVE_DEBUG_ATTACH is set, then the attachment occurs at:
#0  0x7f8b794fe060 in __nanosleep_nocancel () from /lib64/
#1  0x7f8b794fde9c in sleep () from /lib64/
#2  0x7f8b79c663df in slave_init () at adi.c:178
#3  fork_slaves () at adi.c:295
#4  MPI_SGI_create_slaves () at adi.c:419
#5  MPI_SGI_init () at adi.c:644
#6  0x7f8b7a165df8 in call_init () from /lib64/
On the other hand, for MPI_DAEMON_DEBUG_ATTACH, the attachment point occurs at an earlier moment:
#0  0x7fb97e8a1060 in __nanosleep_nocancel () from /lib64/
#1  0x7fb97e8a0e9c in sleep () from /lib64/
#2  0x7fb97f008a16 in MPI_SGI_init () at adi.c:450
#3  0x7fb97f508df8 in call_init () from /lib64/

MPI point-to-point communications

Thus, there are three levels of perspectives:

User program Blocking ? Dead lock ? Buffer reuse ?
MPI protocol Matching receive posted (Synchronous) ? Matching receive must be posted first ? Buffer reuse ?
MPI implementation Eager or Rendezvous ?

Send modes

Mode Blocked until ? Synchronous ? Buffer reuse ? Matching receive must be posted first? Analysis
MPI_Send Buffer is safe to reuse (and when this happens, it does not mean the message is received or a matching receive is posted) N/A N/A No Strike a balance between MPI_Ssend and MPI_Rsend
MPI_Ssend A matching receive is posted (S=Synchronous) Yes No No Greatest overhead, but safe.
MPI_Bsend Data is copied to the buffer attached with MPI_Buffer_attach. (B=Buffered). N/A Yes. (Must call MPI_Buffer_attach first) No Decouple sender and reciver.
MPI_Rsend Data transfer completes. Yes N/A Yes (R=Ready) Least overhead, but could cause error if matching receive is not posted yet.
MPI_Isend Never block (I=Immediate). N/A No No
MPI_Issend Never block. Yes No No
MPI_Ibsend Never block. N/A Not until tested by MPI_Wait/MPI_Test No
MPI_Irsend Never block. Yes No Yes

The receiver side is simple: There are only four APIs:

The purpose of MPI_Irecv is really to post a matching receive.

MPI_Probe can check if there is any receivable message that matches the specified source, tag, and comm.

Test for communication completion

For all those MPI_I* API's, one will get an MPI_Request handle. Both the sender and receiver can then use the following APIs on this handle to test communication completion:
MPI_Test       MPI_Wait
MPI_Testsome   MPI_Waitsome
MPI_Testall    MPI_Waitall
MPI_Testany    MPI_Waitany
MPI_Test* are non-blocking while MPI_Wait* are blocking.

MPI_*some is between the two extremes MPI_{Test|Wait} and MPI_*all.

Caveat: For sender, send operation completion only means the buffer is ready to reuse; it doesn't mean the data have been received.

Persistent communication mode

One can group many point-to-point communications into a batch and initiate the transfers in one shot. The relevant APIs are:
MPI_Send_init  MPI_Bsend_init MPI_Rsend_init MPI_Ssend_init
MPI_Start      MPI_Startall

Dead lock scenarios and avoidance

Process 1 Process 2
Deadlock 1 Recv/Send Recv/Send
Deadlock 2 Send/Recv Send/Recv
Solution 1 Send/Recv Recv/Send
Solution 2 SendRecv SendRecv
Solution 3 IRecv/Send, Wait IRecv/Send, Wait
Solution 4 BSend/Recv BSend/Recv

MPI-2 one-sided point-to-point communications

The above point-to-point communications are two-sided, i.e. they require matching operations on both sender and receiver sides. MPI-2 introduces one-sided point-to-point communications to facilitate remote memory accesses (RMA), which are essential for partitioned global address space languages. Unfortunately, MPI-2's RMA semantics are unintuitive and difficult to use.


First, each process needs to designate a memory region (called a window in MPI-2 parlance) for RMA. This is achieved by calling a collective function:
This function associates a memory region with a communicator. The sizes of the memory regions can differ from processes to processes, but it is important for all participating processes to call it, even if a process only offers memory for others to accesses and does not access other processes' memory.

The "destructor" of MPI_Win_create is



Three RMA operations are supported
MPI_Accumulate is an atomic operation to achieve the remote update (Get-Modify-Put) operations. However, the kind of Modify operations it supports is the predefined MPI_Reduce operations such as MPI_SUM, MPI_PROD, MPI_MAX, MPI_LAND... User-defined reduction operations are not supported

It is important to note that these operations are non-blocking. It makes sense that MPI_Put and MPI_Accumulate are non-blocking, but to understand why MPI_Get is also non-blocking, read on.


Now comes the most criticized and confusing part of MPI-2 RMA. The inclusion of synchronization in MPI-2 RMA seems to facilitate MPI implementers, not MPI users.

As mentioned earlier, all RMA operations are non-blocking, which means, like MPI_I* functions, there must be accompanying synchronization functions like MPI_Wait. And as in MPI_I*'s case, the RMA operations cannot appear anywhere in the program; they must be surrounded by these synchronization functions; a pair of consecutive synchronizations is called an epoch in MPI-2 parlance.

Passive target

This synchronization mode is the easiest to understand and the most intuitive (closest to a shared memory model). Passive means the target process (the one which others will access its memory) does not need to do anything; there is no call to any synchronization function. Only originating processes need to synchronize among themselves, using the following pair of calls:
MPI_Win_lock(lockType, targetRank, hint, assert, win)
MPI_Win_unlock(targetRank, win)
where lockType can be MPI_LOCK_SHARED or MPI_LOCK_EXCLUSIVE.

assert is usually 0, and it can be set to MPI_MODE_NOCHECK, which means no other processes will hold or attempt to acquire a conflicting lock.

It is important to note that the lock granularity is the entire window.

Active target, coarse-grained

Active means the target process will also need to call synchronization functions, even if itself does not access others' window. In coarse-grained mode, all involving processes must have the same number of the following call:
MPI_Win_fence(assert, win)
Recall that a window is associated with a communicator, so all processes in a communicator, no matter they are target or originating processes, must perform the same number of MPI_Win_fence calls. The epochs are the periods between consecutive MPI_Win_fence calls, and RMA operations in the same epoch are then synchronized.

assert is usually 0, but it can be set to

Active target, fine-grained

This is the most complicated synchronization mode. The target process (only one) calls the following pairs of functions to open/close an epoch for the specified originating processes (plural):
MPI_Win_post(group, assert, win)
where group is an MPI_Group object consisting of ranks of the specified originating processes.

The original processes (plural) then use the following pairs to open/close the corresponding epoch:

MPI_Win_start(group, assert, win)
Within the epoch the original processes can then perform RMA operations.
(assert can be 0 or MPI_MODE_NOSTORE, MPI_MODE_NOPUT, etc)

Update semantics and erroneous behaviors

The following behaviors are considered erroneous (as in MPI-2 specification) and should be avoided