ACM | Communication Management Assistant. It provides address and route (path) resolution services to the InfiniBand fabrics, just like the Address Resolution Protocol (ARP). |
ADI | Abstract Device Interface (MPICH specific) |
AH | Address handle (InfiniBand specific) An object which describes the path to the remote side used in UD QP. |
AL | Access layer (InfiniBand specific) Low level operating system infrastructure (plumbing) used for accessing the interconnect fabric (VPI?, InfiniBandR, Ethernet, FCoE). It includes all basic transport services needed to support upper level network protocols, middleware, and management agents. |
ALPS | Application Level Placement Scheduler (Cray MPI specific) |
APM | Automatic Path Migration (InfiniBand specific) |
ARMCI | Aggregate Remote Memory Copy Interface |
BML | BTL Management Layer. It is a thin layer between PML and BTL in Open MPI and it provides multiplexing between multiple PMLs and multiple BTLs. |
BTL | Byte Transfer Layer. This is the fabric-specific layer in Open MPI and it is used by OB1 & DR PML. |
CH | Channel (MPICH specific) |
CM | Connection/Communication Manager (InfiniBand specific) An entity responsible to establish, maintain, and release communication for RC and UC QP service types. The Service ID Resolution Protocol enables users of UD service to locate QPs supporting their desired service. There is a CM in every IB port of the end nodes. In the context of Open MPI, CM is a PML implementation, and its etymology is Connor MacLeod from The Highlander movie. Unlike OB1, CM is designed to utilize directly the native, MPI-like tagged matching interface exposed by certain fabrics, e.g. Myrinet MX, QLogic PSM, and Cray Portals. |
CMA | Communication Manager Abstraction, which is another name of RDMA CM (OpenFabrics specific) |
CNO | Consumer Notification Object (DAPL specific) |
CQE | Completion queue entry (InfiniBand specific) An entry in the Completion Queue that describes the information about the completed WR (status, size etc.) |
CR | Connection Request |
CRCP | Checkpoint/Restart Coordination Protocol; it is a PML implementation in Open MPI. |
CXGB | Chelsio T3/T4 10 Gigabit Ethernet |
DAPL | Direct Access Programming Library |
DAT | Direct Access Transport |
DBL | Data Bypass Layer, a user-level, kernel-bypass, messaging software package for Myricom 10-Gigabit Ethernet. |
DDP | Direct Data Placement |
DDR | Double data rate (InfiniBand specific) |
DPM | Dynamic Process Management (Open MPI specific) |
DR | Data Reliability. It is one of PML implementations in Open MPI; unlike OB1 which relies on error codes propogated from lower level communication libraries, DR can better handle network failures. |
DTC | Data Transfer Completion (DAPL specific) |
DTO | Data Transfer Operation (DAPL specific) |
EE EEC | End-to-end context (InfiniBand specific) |
eHCA | IBM eServer GX-based InfiniBand HCA |
EP | Endpoint |
EVD | Event Dispatcher (DAPL specific) |
FCA | Voltaire's (now Mellanox) Fabric Collective Accelerator |
FCoE | Fibre Channel over Ethernet |
FMS | Fabric Management System (Myrinet specific) |
GbE | Gigabit Ethernet |
GM | Myrinet Generic Messages protocol |
GRU | Global Reference Unit (SGI Altix specific) |
GSI | General Services Interface |
HCA | Host Channel Adapter (InfiniBand specific) |
IA | Interface adapter |
IB | InfiniBand |
IBAL | InfiniBand Access Layer |
IBTA | InfiniBand Trade Association |
IPoIB | IP over InfiniBand |
iPath | QLogic InfiniPath HCA |
iSER | iSCSI Extensions for RDMA |
ITAPI | Interconnect Transport API, one of the APIs for InfiniBand. |
iWARP | Internet Wide Area RDMA Protocol, a competing protocol of InfiniBand |
kDAPL | Kernel level DAPL |
LID | Local identifier (InfiniBand specific) A 16 bit address assigned to end nodes by the Subnet Manager. Each LID is unique within its subnet. |
LMR | Local Memory Region |
MAD | Management Datagram (InfiniBand specific) |
MCA | Modular Component Architecture (Open MPI specific) |
MCP | Myrinet Control Program |
MLX4 | Mellanox ConnectX adapter |
MPD | Multi-Purpose Daemon (MPI specific) |
MPID | A name space in MPICH source code tree. Functions in this name space are hardware-dependent (D for Device) |
MPIR | A name space in MPICH source code tree. Functions in this name space are hardware-independent (R for Runtime) |
MPT | Message Passing Toolkit |
MR | Memory region (InfiniBand specific) A set of memory buffers which have already been registered with access permissions. These buffers need to be registered in order for the network adapter to make use of them. During registration an lkey and rkey are created associated with the created memory region. |
MSI | Message signaled interrupt |
MTHCA | Mellanox InfiniHost device |
MTL | Matching Tranport Layer (used by the "CM" PML implementation in Open MPI) |
MX | Myrinet Express protocol |
MXM | MellanoX Messaging |
NES | Intel-NetEffect Ethernet Cluster Server adapter |
OB1 | A PML implementation in Open MPI; its focus is high performance and will use RDMA if available. This name is inspired by Star Wars's Obi-Wan. Open MPI also has a BML implementation called R2 |
ODLS | ORTE Daemon Launch System (Open MPI specific) |
OFA | Open Fabrics Alliance (was OpenIB) |
OFED | Open Fabrics Enterprise Distribution |
OMPI | Open MPI |
OPA | Open Portable Atomics (MPICH specific) |
OPAL | Open Portability Access Layer (Open MPI specific) |
OpenPA | Open Portable Atomics (MPICH specific) |
ORTE | Open Run-Time Environment (Open MPI specific) |
OpenIB | Open Fabrics Alliance |
OSC | One Sided Communication (Open MPI specific) |
P4 | Portable Programs for Parallel Processors protocol (MPICH specific) |
PD | Protection domain (InfiniBand specific) Object whose components can interact with only each other. AHs interact with QPs, and MRs interact with WQs. |
PIO | Programmed I/O |
PLS | Process Launch System (Open MPI specific) |
PMA | Performance Management Agent (InfiniBand specific) |
PMI | Process Management Interface (MPICH2 specific) |
PML | Point-To-Point Management Layer. As opposed to BTL, PML is the fabric-independent layer in Open MPI. |
PP | Per peer. In Open MPI, each QP can be specified to use an SRQ or PP receive buffers posted to QP directly. |
PSM | QLogic Performance Scaled Messaging |
PSP | Public Service Point (DAPL specific) |
PTL | Portals, the lowest-level network transport layer on Cray platform |
PZ | Protection Zone (DAPL specific) |
QDR | Quad data rate (InfiniBand specific) |
QP | Queue pair (InfiniBand specific) The pair (Send Queue and Receive Queue) of independent WQs packed together in one object for the purpose of transferring data between nodes of a network. Posts are used to initiate the sending or receiving of data. There are three types of QP: UD, UC, and RC. |
RAS | Resource Allocation Subsystem, which queries the batch job system to determine what resources have been allocated to the MPI job; it is a MCA component in Open MPI |
RC | Reliable Connection (InfiniBand specific) A QP Transport service type based on a connection oriented protocol. A QP is associated with another single QP. The messages are sent in a reliable way (in terms of the correctness of order and the information.) |
RCache | Registration Cache, which allows memory pools to cache registered memory for later operations. |
RD | Reliable Datagram (InfiniBand specific) |
RDMA | Remote Direct Memory Access |
RDS | Reliable Datagram Socket |
RMA | Remote Memory Access |
RMAPS | Resource MAPping Subsystem, which maps MPI processes to specific nodes/cores; it is a MCA component in Open MPI |
RMC | Reliable Multicast (InfiniBand specific) |
RML | Runtime Messaging Layer communication interface, which provices basic point-to-point communication between ORTE processes; it is a MCA component in Open MPI |
RMR | Remote Memory Region |
RNR | Receiver not ready (InfiniBand specific) The flow in an RC QP where there is a connection between the sides but a Receive Request is not present in the Receive side. |
RO | Remote Operation (InfiniBand specific) |
RoCEE | RDMA over Converged Enhanced Ethernet |
RQ | Request Queue |
RSP | Reserved Service Point (DAPL specific) |
RTR | Ready to Receive (InfiniBand specific) |
RTS | Ready to Send (InfiniBand specific) |
SA | Subnet Administrator (InfiniBand specific) |
SCI | Scalable Coherent Interface |
SCM | uDAPL socket based CM (DAPL/OpenFabrics specific) |
SDP | Sockets Direct Protocol, which is a protocol over an RDMA fabric (usually InfiniBand) to support stream sockets (SOCK_STREAM) |
SGE | Scatter/Gather Elements (InfiniBand specific) An entry to a pointer to a full or a part of a local registered memory block. The element hold the start address of the block, size, and lkey (with its associated permissions). |
SHMEM | Shared Memory |
SM | Shared Memory (Intel MPI specific), or Subnet Manager (InfiniBand specific) |
SMA | Subnet Manager Agent (InfiniBand specific) |
SMI | Subnet Management Interface |
SNAPC | Snapshot Coordination; a MCA component in Open MPI |
SR | Send Request (InfiniBand specific) A WR which was posted to an SQ which describes how much data is going to be transferred, its direction, and the way (the opcode will specify the transfer) |
SRP | SCSI RDMA Protocol |
SRQ | Shared Receive Queue (InfiniBand specific) A queue which holds WQEs for incoming messages from any RC/UD QP which is associated with it. More than one QPs can be associated with one SRQ. |
TCA | Target Channel Adapter (InfiniBand specific) A Channel Adapter that is not required to support Verbs, usually used in I/O devices |
TMI | Tag Matching Interface, e.g. QLogic PSM and Myricom MX |
TOE | TCP Offload Engine |
UC | Unreliable Connection (InfiniBand specific) A QP transport service type based on a connection oriented protocol, where a QP is associated with another single QP. The QPs do not execute a reliable Protocol and messages can be lost. |
UCM | uDAPL unreliable datagram based CM (DAPL/OpenFabrics specific) |
UD | Unreliable Datagram (InfiniBand specific) A QP transport service type in which messages can be one packet length and every UD QP can send/receive messages from another UD QP in the subnet. Messages can be lost and the order is not guaranteed. UD QP is the only type which supports multicast messages. |
UDA | Mellanox's Unstructured Data Accelerator |
uDAPL | User level DAPL |
ULP | Upper Layer Protocol (InfiniBand specific) |
V | Message logging and replay protocol; it is a PML implementation in Open MPI and is based on MPICH-V (V for Volatility). |
VAPI | Mellanox's Verbs API for InfiniBand. |
VI | Virtual Interface |
VIA | Virtual Interface Architecture |
VIPL | VI Provider Library |
VMA | Voltaire Messaging Accelerator |
VPI | Virtual Protocol Interconnect It allows the user to change the layer 2 protocol of the port. |
WQE | Work Queue Element (InfiniBand specific) |
WR | Work request (InfiniBand specific) A request which was posted by a user to a work queue |
XRC | eXtended Reliable Connection. It is a new transport layer introduced by Mellanox to improve scalability in multi-core environments. It allows one single receive QP to be shared by multiple SRQs across one or more processes. |
XPMEM | Cross Partition Memory (SGI Altix specific) |
Useful info for InfiniBand: here and here
MX_STATS | (libmyriexpress must be compiled with --enable-stats when running the configure script) Set to 1 to enable reporting of statistics when the endpoint is closed. If the MPI library does not close the endpoint, nothing will be printed out. |
MX_DEBUG_FILE | Output statistics and all debugging information to this file |
MX_VERBOSE | Set to a non-zero value to display debugging information. The bigger the value, the more verbose. |
MX_MONOTHREAD | Set to 1 to use single thread. Default is 0 |
MX_ZOMBIE_SEND | Set to 0 to ensure a message is only reported as completed when it has safely reached its destination. |
MX_NO_MYRINET | Set to 1 to allow intra-node only communication |
MX_DISABLE_SELF | Set to 1 to disable self-communication channel |
MX_DISABLE_SHMEM | Set to 1 to disable shared-memory channel |
MX_RCACHE | Set to 0 to disable registration cache. This is related to bandwidth performance tuning. |
MX_CPUS | Set to a non-zero value to enable CPU-process affinity |
MX_IMM_ACK | Set to 1 to enable immediate ACK. Default is 4 |
MX_CSUM | Set to 1 to enable checksum |
To compile libmyriexpress, one needs to run the configure script as:
./configure --with-linux=/usr/src/kernels/2.6.xx --with-fms-run=/var/run/fms --with-fms-server=<hostname>There are other features one can enable, such as --enable-stats, --enable-thread-safety, etc.
Both mx_isend and mx_issend use an internal function mx__post_send (in libmyriexpress/mx__post_send.c) to post the send request to the request queue. The messages are classified according to their sizes: MX__REQUEST_TYPE_SEND_TINY, MX__REQUEST_TYPE_SEND_SMALL, MX__REQUEST_TYPE_SEND_MEDIUM, and MX__REQUEST_TYPE_SEND_LARGE.
MX also has a function mx__luigi to enable progress of request queues. It has a corresponding API called mx_progress.
The actual send is done via memory mapped I/O. The message is copied to a memory region using function mx_copy_ureq and the device driver in the kernel will pick it up.
There are also ioctl interface for libmyriexpress to communicate with the device driver. First mx__open will try to open the following device files with O_RDWR access:
/dev/mxp* /dev/mx*(mxp* are privileged devices)
Then MX uses mx__ioctl to send commands to the device driver. These low-level commands can be found in libmyriexpress/mx__driver_interface.c and common/mx_io_impl.h. The common commands used by Intel MPI are
MX_APP_WAIT MX_GET_MEDIUM_MESSAGE_THRESHOLD MX_ARM_TIMER MX_GET_NIC_ID MX_GET_BOARD_TYPE MX_GET_SMALL_MESSAGE_THRESHOLD MX_GET_COPYBLOCKS MX_GET_VERSION MX_GET_INSTANCE_COUNT MX_NIC_ID_TO_PEER_INDEX MX_GET_MAX_ENDPOINTS MX_PEER_INDEX_TO_NIC_ID MX_GET_MAX_PEERS MX_SET_ENDPOINT MX_GET_MAX_RDMA_WINDOWS MX_WAKE MX_GET_MAX_SEND_HANDLES
In libmyriexpress/mx__lib_types.h there is a description of "life of a request". It is copied here:
4,8: mx__post_send calls mx__ptr_stack_pop to get room to write the data.
5,9: mx__process_events calls mx__ptr_stack_push to release room.
Skip 1, do 2,3
4: mx_post_send calls mx__memory_pool_alloc to get room in the send copyblock
11: mx__send_acked_and_mcp_complete calls mx__release_send_medium which calls mx__memory_pool_free
If unexpected (no match found), mx__create_unexp_for_evt is called to generate an unexpected request and store it in the unexp queue
If expected (match found), the corresponding receive requests is filled with the message info, and MX__REQUEST_STATE_RECV_MATCHED is set
For tiny messages, mx__process_recv_tiny copies the data from the event and calls mx__recv_complete if the message is expected
For small messages, mx__process_recv_small copies the data from the copy block and calls mx__recv_complete if the message is expected
For medium messages, mx__process_recv_medium moves the requests in the multifrag_recvq if the message is expected, inserts in the partner partialq if it is not that last fragment, and calls mx__process_recv_copy_frag
1: mx__process_recv_copy_frag copies the fragment from the copy block and calls mx__received_last_frag if it was the last fragment
2: mx__received_last_frag removes the partner partialq if there's more than one fragment, removes from the multifrag_recvq if expected, and calls mx__recv_complete
For large message, mx__process_recv_large calls mx__queue_large_recv if the message is expected (see Recv Large below)
If large, mx__queue_large_recv is called (see Recv Large below)
The warning message in question pops up when mx__regcache_works returns 0. For Linux, this means when calling a pair of malloc/free, the variable mx__hook_triggered is not triggerred.
Registration cache checks can be disabled by setting the environmental variable MX_RCACHE to 2.
Registration cache can sometimes cause weird errors. It can be disabled by setting the environmental variable MX_RCACHE to 0.
/usr/bin/ib* /usr/sbin/ib*ibstatus, ibstat, ibv_devinfo should print out the device info (All of them query /sys/class/infiniband/). ibv_devinfo is particularly useful to see the health of the InfiniBand card.
ibnetdiscover should print out the whole InfiniBand fabric.
perfquery should print out the performance data and numbers of errors of the InfiniBand card.
To test point-to-point RDMA bandwidth, use ib_read_bw or ib_send_bw. First, login to node A and run ib_read_bw, which will start a server and wait for connection. Then login to node B and run
$ ib_read_bw Aand it will connect to the server at node A and report the bandwidth.
Similarly, use ib_read_lat or ib_send_lat to measure point-to-point RDMA latency.
The infiniband-diags package contains many other InfiniBand diagnostic utilities.
IBV_QPT_RC // Reliable Connection IBV_QPT_UC // Unreliable Connection IBV_QPT_UD // Unreliable Datagram, which supports multicastInfiniBand actually has four Transport Service types XY where X can be R (Reliable, meaning "acknowledged") or U (Unreliable) and Y can be C (Connection, meaning "connection-oriented") or D (Datagram).
To transmit a message, the sender places a work request (WR) on the send queue of the QP via the ibv_post_send function. Currently the supported WR types (as defined in infiniband/verbs.h) depend on QP's Transport service type, and they can be
IBV_WR_SEND IBV_WR_SEND_WITH_IMM IBV_WR_RDMA_READ IBV_WR_RDMA_WRITE IBV_WR_RDMA_WRITE_WITH_IMM IBV_WR_ATOMIC_CMP_AND_SWP, IBV_WR_ATOMIC_FETCH_AND_ADDand only IBV_QPT_RC service type supports all of above.
In addition to QP, the involving processes must also create completion queues (CQ) so they can poll the status of WR.
One of the idiosyncracies of InfiniBand is for the IBV_QPT_RC service, the receiver must post a receive request before the sender can place a send request, or incoming message will be rejected and the sender will get a Receiver-Not-Ready error.
The first approach is low-level, InfiniBand-specific InfiniBand Communication Manager (IBCM). IBCM is based on the InfiniBand connection state machine as defined by the IB Architecture Spec, and handles both connection establishment, as well as service ID resolution. The receiver first needs to open an InfiniBand device via ib_cm_open_device function, which will open the device file
/dev/infiniband/ucm*and associates it with a service ID via ib_cm_create_id function. Then the receiver can listen on this service ID with ib_cm_listen function.
One can look at ibcm_component_query function in btl_openib_connect_ibcm.c in Open MPI's source tree for a complete example.
The userspace IBCM source code is libibcm, available here
The second approach is RDMA Communication Manager (RDMACM), also referred to as the CMA (Communication Manager Abstraction), is a higher-level CM that operates based on IP addresses. RDMACM provides an interface that is closer to that defined for TCP/IP sockets, and is transport independent, allowing user access to multiple RDMA devices, such as InfiniBand HCAs and iWARP RNICs. For most developers, the RDMA CM provides a simpler interface sufficient for most applications. As in IBCM's case, the receiver process will access the device file
/dev/infiniband/rdma_cmduring the connection establishment.
Man pages of RDMACM APIs can be found here. The userspace RDMACM source code is librdmacm, available here
One can check rdmacm_component_query function in btl_openib_connect_rdmacm.c in Open MPI's source tree for a complete example, or read this tutorial.
After the connection has been established, the user program calls ibv_get_device_list which is implemented internally as __ibv_get_device_list in device.c. __ibv_get_device_list will do the following:
/sys/class/infiniband_verbs/abi_version
/etc/libibverbs.d/Note that the /etc prefix is configurable when one runs the configure script.
Each configuration file usually contains only one line, e.g. driver mlx4. mlx4 in this case, is the deviceName, which will be used later.
/sys/class/infiniband_verbs/uverbs*/ibdev /sys/class/infiniband_verbs/uverbs*/abi_version
The try_driver function will then examine the node type by looking at
/sys/class/infiniband/driverName/node_typefile. In this file, CA is InfiniBand Channel Adapter and RNIC is iWARP adatper.
/sys/class/infiniband_verbs/uverbs*/device/vendor /sys/class/infiniband_verbs/uverbs*/device/device
/dev/infiniband/uverb*for read/write access.
static struct ibv_device_ops mlx4_dev_ops = { .alloc_context = mlx4_alloc_context, .free_context = mlx4_free_context };and the following assignment in mlx4_alloc_context function
context->ibv_ctx.ops = mlx4_ctx_ops;
For example, the ibv_post_send Verbs API is simply defined as
static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,struct ibv_send_wr **bad_wr) { return qp->context->ops.post_send(qp, wr, bad_wr); }One can also do a grep of context->ops. in verbs.c of libibverbs and see how they are used.
So how is ibv_context.op populated ? In MLX4's case, ibv_context.op is populated by simply copying mlx4_ctx_ops, which is fixed at compile time. In the source file mlx4.c one can find the following code snippet:
static struct ibv_context_ops mlx4_ctx_ops = { .query_device = mlx4_query_device, .query_port = mlx4_query_port, . . . . .attach_mcast = mlx4_attach_mcast, .detach_mcast = mlx4_detach_mcast };Documentation of the InfiniBand Verbs API and RDMACM API can be found in RDMA Aware Networks Programming User Manual from Mellanox.