7 MPI commands and UNIX system calls in ParGAP

Sections

This chapter can be safely ignored on a first reading, and maybe permanently. It is for application programmers who wish to develop their own low-level message-based parallel application. The additional UNIX system calls in ParGAP may also be useful in some applications.

7.1 Tutorial introduction to the MPI C library

This section lists some of the more common message passing commands, followed by a short MPI example. The next section (Other low level commands) contains more (but by no means all) of the MPI commands and some UNIX system calls. The ParGAP binding provides a simplified form that makes interactive usage easier. This section describes the original MPI binding in C, with some comments about the interactive versions provided in ParGAP. (The MPI standard includes a binding both to C and to FORTRAN.)

Even if your ultimate goal is a standalone C-based application, it is useful to prototype your application with equivalent commands executed interactively within ParGAP. Note that this distribution includes a subdirectory mpinu, which provides a subset MPI implementation in C with a small footprint. It consists of a C library, libmpi.a, and an include file mpi.h. The library is approximately 150 KB. The subdirectory can be consulted for further details.

We first briefly explain some MPI concepts.

rank:: The rank of an MPI process is a unique ID number associated with the process. By convention, the console (master) process has rank 0. The ranks of the process are guaranteed by MPI to form a consecutive, ascending sequence of integers, starting with 0.
tag:: Each message has associated with it a non-negative integer tag specified by the application. Our interface allows you to ignore tags by letting them take on default values. Typical application uses for tags are either to choose consecutive integers in order to guarantee that all messages can be re-assembled in sequence, or to choose a fixed set of constant tags, each constant associated with another type of message. In the latter case, one might have integers for a QUIT_TAG, an INTEGER_ARRAY_TAG, START_TASK2_TAG, etc. In fact, our implementation of the Slave Listener and MasterSlave() specifically uses certain tags of value 1000 and higher for these purposes. Hence, application routines that do use tags should restrict themselves to tags [0..999].
communicator:: A communicator in MPI serves the purpose of a namespace. Most MPI commands require a communicator argument to specify the namespace. MPI starts up with a default namespace, MPI_COMM_WORLD. The ParGAP implementation always assumes that single namespace. A namespace is important in MPI to build modules and library routines, so that a thread may distinguish messages meant for itself, or to catch errors of cross-communication between two modules.
message:: Each message in MPI is typically implemented to include fields for the source rank, destination rank (optional), tag, communicator, count, and an array of data. The count field specifies the length of the array. MPI guarantees that messages are non-overtaking, in the sense that if two messages are sent from a single source process to the same destination process, then it is guaranteed that the first process sent will be the first one to arrive, and will be received or probed first from the queue.
other:: MPI also has concepts of datatype, derived datatype, group, topology, etc. This implementation defaults those values, so that datatype is always a character (hence the use of strings in ParGAP), no derived datatypes are implemented, group is always consistent with MPI_COMM_WORLD, and topology is the fully connected topology.
communication:: This implementation implements only point-to-point communication (always blocking receives, except for MPI_Iprobe, and sends can be blocking or not, according to the default underlying sockets).
collective communication:: The MPI standard also provides for collective communication, which sets up a barrier in which all process within the named communicator must participate. One process is distinguished as the root process in cases of asymmetric usage. ParGAP does not implement any collective communication (although you can easily emulate it using a sequence of point-to-point commands). The MPI subset distribution (in ParGAP's mpinu directory) does provide some commands for collective communication. Examples of MPI collective communication commands are MPI_Bcast (broadcast), MPI_Gather (place an entry from each process in an array residing on the root process), MPI_Scatter (inverse of gather), MPI_Reduce (execute a commutative, associative function with an entry from each process and store on root; example functions are sum, and, xor, etc.
dynamic processes:: The newer MPI-2 standard allows for the dynamic creation of new processes on new processors in an ongoing MPI computation. The standard is silent on whether an MPI session should be aborted if one of its member processes dies, and the MPI standard provides no mechanism to recognize such a dead process. Part of the reason for this silence is that much of the ancestry of MPI lies in dedicated parallel computers for which it would be unusual for one process or processor to die.

Here is a short extract of MPI code to illustrate its flavor. It illustrates the C equivalents of the following ParGAP commands. Note that the ParGAP versions noted here take fewer parameters than their C-based cousins, and ParGAP includes defaults for some optional parameters.

MPI_Init() [ called for you automatically when ParGAP is loaded ] F

MPI_Finalize() [ called for you automatically when GAP quits ] F

MPI_Comm_rank() F

MPI_Get_count() F

MPI_Get_source() F

MPI_Get_tag() F

MPI_Comm_size() F

MPI_Send( string buf, int dest[, int tag = 0 ] ) F

MPI_Recv( string buf [, int source = MPI_ANY_SOURCE[, int tag = MPI_ANY_TAG ] ] ) F

MPI_Probe( [ int source = MPI_ANY_SOURCE[, int tag = MPI_ANY_TAG ] ] ) F

Many of the above commands have analogues at a higher level in section Slave Listener Commands as GetLastMsgSource(), GetLastMsgTag(), MPI_Comm_size() = TOPCnumSlaves + 1, SendMsg(), RecvMsg() and ProbeMsg().

#include <stdlib.h>
#include <mpi.h>

#define MYCOUNT 5
#define INT_TAG 1

main( int argc, char *argv[] )
{
  int myrank;
  MPI_Init( &argc, &argv );
  MPI_Comm_rank( MPI_COMM_WORLD, &myrank );

  if ( myrank == 0 ) {
    int mysize, dest, i;
    int buf;
    printf("My rank (master):  %d\n", myrank);
    for ( i=0; i<MYCOUNT; i++ )
      buf = 5;
    MPI_Comm_size( MPI_COMM_WORLD, &mysize );
    printf("Size:  %d\n", mysize);
    for ( dest=1; dest< mysize; dest++ )
      MPI_Send( &buf, MYCOUNT, MPI_INT, dest, INT_TAG, MPI_COMM_WORLD );
  } else {
    int i;
    MPI_Status status;
    int source;
    int count;
    int *buf;
    printf("My rank (slave):  %d\n", myrank);

    MPI_Probe( MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status );
    printf( "Message pending with rank %d and tag %d.\n",
            status.MPI_SOURCE, status.MPI_TAG );
    if ( status.MPI_TAG != INT_TAG )
      printf("Error: Bad tag.\n"); exit(1);
    MPI_Get_count( &status, MPI_INT, &count );
    printf( "The count of how many data units (MPI_INT) is:  %d.\n", count );
    buf = (int *)malloc( count * sizeof(int) );

    source = status.MPI_SOURCE;
    MPI_Recv( buf, count, MPI_INT, source, INT_TAG, MPI_COMM_WORLD, &status );
    for ( i=0; i<MYCOUNT; i++ )
      if ( *buf != 5 ) printf("error:  buf[%d] != 5\n", i);
    printf("slave %d done.\n", myrank);
    }
  MPI_Finalize();
  exit(0);
}

Even in this simplistic example, it was important to specify

MPI_Recv( buf, count, MPI_INT, source, INT_TAG, MPI_COMM_WORLD, &status );

and not to use MPI_ANY_SOURCE instead of the known source. Although this alternative would often work, there is a danger that there might be a second incoming message from a different source that arrives between the calls to MPI_Probe() and MPI_Recv(). In such an event, MPI would be free to receive the second message in MPI_Recv(), even though the appropriate count of the second message is likely to be different, thus risking an overflow of the buf buffer.

Other typical bugs in MPI programs are:

·

Incorrectly matching corresponding sends and receives or having more or fewer sends than receives due to the logic of multiple sends and receives within distinct loops.

·

Reaching deadlock because all active processes have blocking calls to MPI_Recv() while no process has yet reached code that executes MPI_Send().

·

Incorrect use of barriers in collective communication, whereby one process might execute:

MPI_Send( buf, count, datatype, dest, tag, COMM_1 );
MPI_Bcast( buffer, count, datatype, root, COMM_2 );

and a second executes

MPI_Bcast( buffer, count, datatype, root, COMM_2 );
MPI_Recv( buf, count, datatype, dest, tag, COMM_1, status );

If the call to MPI_Send() is blocking (as is the case for long messages in the case of many implementations), then the first process will block at MPI_Send() while the second blocks at 'MPI_Bcast()'. This happens even though they use distinct communicators, and the send-receive communication would not normally interact with the broadcast communication.

Much of the TOP-C method in ParGAP (see chapters Basic Concepts for the TOP-C model (MasterSlave) and MasterSlave Tutorial) was developed precisely to make errors like those above syntactically impossible. The slave listener layer also does some additional work to keep track of the status that was last received and other bookkeeping. Additionally, the TOP-C method was designed to provide a higher level, task-oriented ``language'', which would naturally lead the application programmer into designing an efficient high level algorithm.

7.2 Other low level commands

Here is a complete listing of the low level commands available in ParGAP. Some of these commands were documented elsewhere. The remaining ones are not recommended for most users. Nevertheless, it may be useful to others for more sophisticated applications.

For most of these commands, the source code is the ultimat documentation. However, you may be able to guess at the meaning of many of them based on their names and their similarity UNIX system calls (in the case of UNIX_...) or MPI commands (in the case of MPI...). Some of the commands will also show you their calling parameters if called with the wrong number of arguments. Many of the MPI commands have simplified calling parameters with certain arguments optional or set to defaults, making them easier for interactive use.

UNIX_MakeString( len ) F

UNIX_DirectoryCurrent() [ Defined in `pkg/pargap/lib/slavelist.g ] F

UNIX_Chdir( string ) F

UNIX_FflushStdout() F

UNIX_Catch( function, return_val ) F

UNIX_Throw() F

UNIX_Getpid() F

UNIX_Hostname() F

UNIX_Alarm( seconds ) F

UNIX_Realtime() F

UNIX_Nice( priority ) F

UNIX_LimitRss( bytes_of_ram ) [ = setrlimit(RLIMIT_RSS, ...) ] F

MPI_Init() F

MPI_Initialized() F

MPI_Finalize() F

MPI_Comm_rank() F

MPI_Get_count() F

MPI_Get_source() F

MPI_Get_tag() F

MPI_Comm_size() F

MPI_World_size() F

MPI_Error_string( errorcode ) F

MPI_Get_processor_name() F

MPI_Attr_get( keyval ) F

MPI_Abort( errorcode ) F

MPI_Send( string buf, int dest[, int tag = 0 ] ) F

MPI_Recv( string buf [, int source = MPI_ANY_SOURCE[, int tag = MPI_ANY_TAG ] ] ) F

MPI_Probe( [ int source = MPI_ANY_SOURCE[, int tag = MPI_ANY_TAG ] ] ) F

MPI_Iprobe( [ int source = MPI_ANY_SOURCE[, int tag = MPI_ANY_TAG ] ] ) F

MPI_ANY_SOURCE V

MPI_ANY_TAG V

MPI_COMM_WORLD V

MPI_TAG_UB V

MPI_HOST V

MPI_IO V

[Up] [Previous] [Next] [Index]

ParGAP manual
May 2002