1 Writing Parallel Programs in GAP Easily

Sections

The ParGAP (Parallel GAP) package provides a way of writing parallel programs using the GAP language. Former names of the package were ParGAP/MPI and GAP/MPI; the word MPI refers to Message Passing Interface, a well-known standard for parallelism. ParGAP is based on the MPI standard, and this distribution includes a subset implementation of MPI, to provide a portable layer with a high level interface to BSD sockets. Since knowledge of MPI is not required for use of this software, we now refer to the package as simply ParGAP. For more information visit the author's ParGAP home page at: http://www.ccs.neu.edu/home/gene/pargap.html

For some background reading, see Coo95 and Coo97.

This first chapter is intended to help a new user set up ParGAP and run through some quick examples: see

Section Overview of ParGAP for an overview of the features of ParGAP and a general discussion of how it's implemented;
Section Installing ParGAP for how to install ParGAP;
Section Running ParGAP for how to run ParGAP (not by using RequirePackage); and
Section Extended Example for some introductory ParGAP examples.

The later chapters present detailed explanations of the facilities of ParGAP. Because parallel programming is sufficiently different from sequential programming, this author recommends printing out at least Chapters 1 through MasterSlave Tutorial, and skimming through those chapters for areas of interest, before returning to the terminal to try out some of the ideas. This document can be found in .../pkg/pargap/doc/manual.dvi of the software distribution. You may also want to print the index at the end of manual.dvi. In particular, the heading example in the index, or ??example from within GAP, should be useful. If you prefer postscript, the UNIX command dvips will convert that file to postscript form.

The development of ParGAP was partially supported by National Science Foundation grants CCR-9509783 and CCR-9732330.

1.1 Overview of ParGAP

ParGAP is currently functional only on UNIX installations. (Cygwin for Windows is also an option, if you would like to port it.) ParGAP can be installed on top of an existing GAP installation. See Section Installing ParGAP for instructions on installation of ParGAP. At the time that ParGAP is invoked, a special ``procgroup'' file must be available to tell ParGAP which processors to use for slave processors. See sections Installing ParGAP and Extended Example for instructions on invoking ParGAP. If there are questions or bugs concerning ParGAP, please write to: gene@ccs.neu.edu

If one wishes only to try out the parallel features, the first five pages of this manual (through the section on the slave listener) will suffice for installation, and using it. For the more advanced user who wishes to design new parallel algorithms or port old sequential code to a parallel environment, it is strongly recommended to also read the sections following on from Section Basic Concepts for the TOP-C model (MasterSlave).

ParGAP should be invoked via the script bin/pargap.sh created by the installation process which invokes GAP_ROOT_DIR/bin/ARCH/pargapmpi, where ARCH depends on your system but is the same directory in which the gap binary is found. MPI and the higher layers will not be available if the binary is invoked in the standard way as gap. This is a feature, since a single binary and source distribution serves both for the standard GAP and for ParGAP.

ParGAP is implemented in three layers: 1) MPI, 2) Slave Listener, and 3) Master Slave (TOP-C abstraction). Most users will find that the two highest layers (Slave Listener and Master Slave) meet all their needs.

1) MPI:

The lowest layer is MPI. Most users can ignore this layer. MPI is a standard for message-based parallel computation. A subset of the original MPI commands is provided. The syntax is modified from the original C binding to make a GAP binding in an interpreted environment more convenient. This includes default arguments, useful return values, and Error break in the presence of errors. MPI_Init() (see MPI_Init) and MPI_Finalize() (see MPI_Finalize) are invoked automatically by ParGAP.

The MPI layer is not documented, since most users will not be using it. From GAP level, you can type: MPI_tabtab to see all implemented MPI functions and variables. However, typing the symbol name alone (e.g.: MPI_Send; ) will cause it to display the calling syntax. The same information is displayed after an incorrect call. The return value is typically obvious. MPI is implemented in src/pargap.c. The standard distribution uses a simple, subset implementation of MPI in pkg/gapmpi/mpinu/, which is implemented on top of a standard sockets interface. It is possible to substitute other implementations of MPI.

For those who wish to directly use the MPI interface, the meanings of the MPI calls are best found from the standard MPI documentation:

MPI Forum: http://www.mpi-forum.org/

MPI Standard (version 1.1): http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html

UNIX style man pages: http://www-c.mcs.anl.gov/mpi/www/

2) Slave Listener:

This layer provides basic message passing facilities for communication among multiple ParGAP processes in a form that is more convenient for programming than the lower MPI layer. This will be the most useful entry point to ParGAP for most users. This is the default mode for ParGAP. Each remote (slave) process is in a receive-eval-send loop, in which the slave receives a GAP command from the local or master, the slave evaluates the GAP command, and the slave then sends the result back to the master as a GAP object.

Almost all commands in the slave listener are of the form *Msg* e.g. SendMsg() (see SendMsg), RecvMsg() (see RecvMsg), ProbeMsg() (see ProbeMsg). Since the slave is in a receive-eval-send loop, every SendMsg(cmd) on the master must be balanced by a later RecvMsg(). SendRecvMsg() (see SendRecvMsg) is provided to combine these steps. A few parallel utilities are also included, such as ParRead() (ParRead), ParList() (ParList), ParEval() (ParEval), etc.

Messages are arbitrary GAP objects. Note that arguments to any GAP function are evaluated before being passed to the function. Hence, any argument to SendMsg() or ParEval() would be evaluated locally before being sent across the network. For this reason, arguments can also be given as strings, to delay evaluation until reaching the destination process. Hence, real strings must be quoted: ParEval("x:="abc";"); Additionally, multiple commands are valid, and the final ``;'' of the string is optional. So, one can write:

BroadcastMsg("x:=\"abc\"; Print(Length(x), \"\\n\")");;

A full description is contained in Chapter Slave Listener.

3) Master Slave:

The Master Slave facility is provided both for writing complex parallel software, and as an easier way to parallelize previous or ``legacy'' sequential code. While the Slave Listener may be sufficient for simple parallel requirements, more complex software requires a higher level abstraction. The fundamental abstractions of the master slave layer are the task and the shared data.

1): The task typically corresponds to the procedure or inner body of a loop in a sequential program. This is the part that must be repetitively computed in parallel.
2): The shared data typically corresponds to the data of a sequential program that is not within the local scope of the task. Often this is a global data structure. In the case that the task is the inner body of a loop, the shared data may be a local data structure that is outside the local scope of the loop.

It is usually quite easy to identify the task and the shared data of a sequential program or algorithm, which is the first step in parallelizing an algorithm.

The Master Slave parallel model described here has also been successfully used in C and in LISP. It has been used both in distributed memory and shared memory environments, although this version in GAP currently works only in a distributed environment. In the C language, this parallel model is known as TOP-C (Task Oriented Parallel C). For examples of the use of the TOP-C model see Coo98, CFTY94, CH97, CHLM97, CLMW96, and CT96.

While no parallel software can eliminate the problem of designing an algorithm that is efficient in a parallel environment, the TOP-C abstraction eases the job by eliminating programmer concerns about lower level details, such as message passing, migration and replication of data, load balancing, etc. This leaves the programmer to concentrate on the primary goal: maximizing the concurrency or parallelism.

1.2 Installing ParGAP

Installing ParGAP should be relatively simple. However, since there are many interactions both with the GAP kernel and with the UNIX operating system, in a minority of cases, manual intervention will be necessary. If you are part of this minority, please see the section Problems with Installation. The most common problem is the local security policy; ParGAP is more pleasant to use when you don't have to manually provide the password for each slave. See section Problems with Passwords (Getting Around Security) for suggestions in this respect.

To install the ParGAP package, move the file pargap-XXX.zoo or pargap-XXX.tar.gz (for some version number XXX of ParGAP) into the pkg directory in which you plan to install ParGAP. Usually, this will be the directory pkg in the hierarchy of your version of GAP 4 (in fact, currently it is not possible to have the pkg directory separate from GAP's pkg directory; we hope to remedy this in future versions of ParGAP so that it will also possible to keep an additional pkg directory in your private directories; section Installing GAP Packages of the GAP 4 reference manual gives details on how to do this, when it's possible.)

Now change into the pkg directory in which you plan to install ParGAP. If you got a .zoo file, unpack it with:

unzoo -x pargap-XXX

If you got a .tar.gz file and your tar command supports the z option, unpack it with:

tar zxf pargap-XXX.tar.gz

or otherwise unpack in two steps with:

gunzip pargap-XXX.tar
tar xvf pargap-XXX.tar

Whether you got the .zoo or .tar.gz archive you should now have a new directory pargap. As for a generic GAP package, do:

cd pargap
./configure ../..
make

If your version of GAP is earlier than GAP 4.3 you will first need to adjust GAP's lib/init.g file; see item 0. of Section Problems with Installation.

Your ParGAP should now be ready to use. Now read the next section which decribes how to run ParGAP (if you are reading this from GAP's on-line help, type: ?>).

1.3 Running ParGAP

After doing the configure and make steps of ParGAP's installation process (see Section Installing ParGAP), you should find in ParGAP's bin subdirectory a script

pargap.sh

which you should use to start ParGAP. (ParGAP can not be started by starting GAP 4 in the usual way, and using RequirePackage; doing so will result in Info-ed advice to read this section.) Edit the pargap.sh script if necessary, copy it to a standard path and rename it according to how you intend to call ParGAP (e.g. rename it: pargap). Also, in the bin subdirectory is a sample procgroup file which defines the master and slave processes that will be used by ParGAP. When ParGAP is started it looks for a file called procgroup in the current directory, unless the -p4pg option is used. Thus if you renamed your shell script pargap, the following are valid ways of starting ParGAP:

pargap

(if current directory contains the file: procgroup), or

pargap -p4pg myprocgroupfile

(where myprocgroupfile is the complete path of your procgroup file -- there is no restriction on how you name it).

If you had trouble installing ParGAP, see the section Problems with Installation. Otherwise continue onto Section Extended Example and try out ParGAP.

Note: The script pargap.sh defines the program that runs ParGAP as pargapmpi. In fact, after installation pargapmpi is a symbolic link to the GAP binary named gap. The same binary runs both GAP and ParGAP; when the binary is invoked as gap GAP runs in the usual way without any parallel features; only when the binary is invoked as pargapmpi are the parallel features incorporated. See Section Modifying the GAP kernel for more details.

Now you are ready to test your installation, try the example in the following section (if you are reading this from GAP's on-line help, type: ?>).

1.4 Extended Example

After installation, try it out. Invoke ParGAP as described in Section Running ParGAP and try the example below (but substitute your own program where you see "/home/gene/myprogram.g"). The commands in this first example are also found in the README file. So, you may wish to copy text from the README file and paste it into a ParGAP session. If you are using the unmodified procgroup file, your remote slaves will be other processes on your local machine. It is a good idea to run only on your local machine for your first experiments and while you are debugging parallel programs. When you wish to experiment with using remote machines, you can then proceed to the following section, Invoking ParGAP with Remote Slaves.

gap> # This assumes your procgroup file includes two slave processes.
gap> PingSlave(1); #a `true' response indicates Slave 1 is alive
true
gap> # Print() on slave appears on standard output 
gap> # i.e. after the master's prompt.
gap> SendMsg( "Print(3+4)" );
gap> 7
gap> # A <return> was input above to get a fresh prompt.
gap> #
gap> # To get special characters (including newline: `\n')
gap> # into a string, escape them with a `\'.
gap> SendMsg( "Print(3+4,\"\\n\")" );
gap> 7

gap> # Again, a <return> was input above after the 7 and new-line
gap> # were printed to get a fresh prompt.
gap> #
gap> # Each SendMsg() is normally balanced by a RecvMsg().
gap> SendMsg( "3+4", 2);
gap> RecvMsg( 2 );
7
gap> # The following is equivalent to the two previous commands.
gap> SendRecvMsg( "3+4", 2);
7
gap> # Flush any messages that are pending. The response is
gap> # the number of messages flushed. (Above, the two
gap> # SendMsg("Print...") (to the default slave: 1) did not
gap> # have a corresponding RecvMsg() command.)
gap> FlushAllMsgs();
2
gap> # As with Print() the result of Exec() appears on standard
gap> # output. Print() and Exec() are each `no-value' functions,
gap> # and so the result of a RecvMsg() in these cases
gap> # is "<no_return_val>".
gap> SendRecvMsg( "Exec(\"pwd\")" ); # Your pwd will differ :-)
/home/gene
"<no_return_val>"
gap> # Put default slave into an infinite loop.
gap> SendMsg("while true do od");
gap> # Default slave can't execute the next command until it's 
gap> # finished with the previous command.
gap> SendMsg("Print(\"WAKE UP\\n\")");
gap> # Check to see if a message is waiting to be collected but
gap> # return immediately (i.e. don't get blocked by waiting for
gap> # a message to appear). A `false' response indicates the
gap> # infinite loop hasn't terminated and produced a value yet!
gap> ProbeMsgNonBlocking();
false
gap> # Send an interrupt to each slave, slave 1 will see the
gap> # following command and print `WAKE UP', and then all
gap> # pending messages are flushed.
gap> ParReset();
... resetting ...
WAKE UP
0
gap> # The return value, 0, from ParReset() indicates there
gap> # were 0 pending messages flushed, confirming correctness
gap> # of ProbeMsgNonBlocking() when it returned "false"
gap> SendRecvMsg( "a:=45; 3+4", 1 );
7
gap> # Note "a" is defined on slave 1, not slave 2.
gap> SendMsg( "a", 2 ); # Slave prints error, output on master
gap>  Variable: 'a' must have a value
gap> # <return> entered to get fresh prompt.
gap> RecvMsg( 2 ); # No value for last SendMsg() command
"<no_return_val>"
gap> RecvMsg( 1 );
45
gap> myfnc := function() return 42; end;;
gap> # Use PrintToString() to define myfnc on all slave processes
gap> BroadcastMsg( PrintToString( "myfnc := ", myfnc ) );
gap> SendRecvMsg( "myfnc()", 1 );
42
gap> FlushAllMsgs(); # There are no messages pending.
0
gap> # Execute analogue of GAP's List() in parallel on slaves.
gap> squares := ParList( [1..100], x->x^2 );
[ 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 
  289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 
  900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 
  1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 
  2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 
  3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 
  5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 
  7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 
  9216, 9409, 9604, 9801, 10000 ]
gap> # Ensure problem shared data is read into master and slaves.
gap> # Try one of your GAP program files instead.
gap> ParRead( "/home/gene/myprogram.g");

Now that you have done a fairly rudimentary test of ParGAP you should be ready to do something a little bit more interesting:

gap> ParInstallTOPCGlobalFunction( "MyParList",
> function( list, fnc )
>   local result, iter;
>   result := [];
>   iter := Iterator(list);
>   MasterSlave( function() if IsDoneIterator(iter) then return NOTASK;
>                           else return NextIterator(iter); fi; end,
>                fnc,
>                function(input,output) result[input] := output;
>                                       return NO_ACTION; end,
>                Error
>              );
>   return result;
> end );
gap> MyParList( [1..25], x->x^3 );
master -> 1:  1
master -> 2:  2
2 -> master: 8
1 -> master: 1
master -> 1:  3
master -> 2:  4
2 -> master: 64
1 -> master: 27
master -> 1:  5
master -> 2:  6
2 -> master: 216
1 -> master: 125
master -> 1:  7
master -> 2:  8
2 -> master: 512
1 -> master: 343
master -> 1:  9
master -> 2:  10
2 -> master: 1000
1 -> master: 729
master -> 1:  11
master -> 2:  12
2 -> master: 1728
1 -> master: 1331
master -> 1:  13
master -> 2:  14
2 -> master: 2744
1 -> master: 2197
master -> 1:  15
master -> 2:  16
2 -> master: 4096
1 -> master: 3375
master -> 1:  17
master -> 2:  18
2 -> master: 5832
1 -> master: 4913
master -> 1:  19
master -> 2:  20
2 -> master: 8000
1 -> master: 6859
master -> 1:  21
master -> 2:  22
2 -> master: 10648
1 -> master: 9261
master -> 1:  23
master -> 2:  24
2 -> master: 13824
1 -> master: 12167
master -> 1:  25
1 -> master: 15625
[ 1, 8, 27, 64, 125, 216, 343, 512, 729, 1000, 1331, 1728, 2197, 2744, 3375, 
  4096, 4913, 5832, 6859, 8000, 9261, 10648, 12167, 13824, 15625 ]
gap> ParInstallTOPCGlobalFunction( "MyParListWithAglom",
> function( list, fnc, aglomCount )
>   local result, iter;
>   result := [];
>   iter := Iterator(list);
>   MasterSlave( function() if IsDoneIterator(iter) then return NOTASK;
>                           else return NextIterator(iter); fi; end,
>                fnc,
>                function(input,output)
>                  local i;
>                  for i in [1..Length(input)] do
>                    result[input[i]] := output[i];
>                  od;
>                  return NO_ACTION;
>                end,
>                Error,  # Never called, can specify anything
>                aglomCount
>              );
>   return result;
> end );
gap> MyParListWithAglom( [1..25], x->x^3, 4 );
master -> 1: (AGGLOM_TASK): [ 1, 2, 3, 4 ]
master -> 2: (AGGLOM_TASK): [ 5, 6, 7, 8 ]
1 -> master: [ 1, 8, 27, 64 ]
2 -> master: [ 125, 216, 343, 512 ]
master -> 1: (AGGLOM_TASK): [ 9, 10, 11, 12 ]
master -> 2: (AGGLOM_TASK): [ 13, 14, 15, 16 ]
1 -> master: [ 729, 1000, 1331, 1728 ]
2 -> master: [ 2197, 2744, 3375, 4096 ]
master -> 1: (AGGLOM_TASK): [ 17, 18, 19, 20 ]
master -> 2: (AGGLOM_TASK): [ 21, 22, 23, 24 ]
1 -> master: [ 4913, 5832, 6859, 8000 ]
2 -> master: [ 9261, 10648, 12167, 13824 ]
master -> 1: (AGGLOM_TASK): [ 25 ]
1 -> master: [ 15625 ]
[ 1, 8, 27, 64, 125, 216, 343, 512, 729, 1000, 1331, 1728, 2197, 2744, 3375, 
  4096, 4913, 5832, 6859, 8000, 9261, 10648, 12167, 13824, 15625 ]

If you wish an accelerated introduction to the models of parallel programming provided here, you might wish to read the beginning of Chapter Slave Listener through section Slave Listener Commands, and then proceed immediately to Chapter Basic Concepts for the TOP-C model (MasterSlave).

1.5 Author

The ParGAP package was designed and written by Gene Cooperman, College of Computer Science, Northeastern University, Boston, MA, U.S.A.

If you use ParGAP to solve a problem then please send a short email to gene@ccs.neu.edu about it, and cite the ParGAP package as follows:

\bibitem[Coo99]{Coo99}
      Cooperman, Gene,
      {\sl Parallel GAP/MPI (ParGAP/MPI)}, Version 1,
      College of Computer Science, Northeastern University, 1999,
      \verb+http://www.ccs.neu.edu/home/gene/pargap.html+.

1.6 Invoking ParGAP with Remote Slaves

ParGAP, unlike GAP, must be invoked under a separate name. After ParGAP has been installed, a script bin/pargap.sh will have been created which (after any changes you needed to make; see Section Installing ParGAP) you should use to invoke ParGAP. This is similar to GAP_ROOT_DIR/bin/gap.sh that is used to invoke the non-parallel GAP. Installers are encouraged to treat pargap.sh in analogy to gap.sh. For example, if your site has copied gap.sh to /usr/local/bin/gap, then you should also look for the pargap.sh script as /usr/local/bin/pargap.

In addition, when pargap (we'll assume that's how ParGAP is invoked at your site) is called, there must be a file, procgroup, in the current directory, or alternatively, if you wish to use a single procgroup file for all jobs, and that procgroup file is in /home/joe, then you can alias pargap to pargap -p4pg /home/joe/procgroup.

The procgroup file has a simple syntax, taken from the MPICH implementation of MPI (inherited from P4). A # in column 1 introduces a comment line. The first non-comment line should be local 0, verbatim. This line declares the master process as the local process. Other lines are of the form:

host-machine 1 pargap-script

e.g.

regulus.ccs.neu.edu 1 /usr/local/bin/pargap

The first field is the hostname for a remote process. The second field specifies one thread per process. (ParGAP recognizes only the value 1 for the second field.) The third field is an absolute pathname for ParGAP, as it would be called on the remote process. Note that you can repeat the same line twice if you want two remote ParGAP processes on the same processor. The default procgroup provided in the distribution will have lines of form:

localhost 1 path-of-provided-pargap.sh

If you change path-of-provided-pargap.sh to just, say, pargap, this will work only if pargap is in your path on the remote machine shell (localhost in this case), using your default shell. On most machines, localhost is an alias for the local processor. This is a good default for debugging, so that you don't disturb users on other machines.

MPI will use a line

host-machine 1 pargap-script

to create a UNIX subprocess executing:

rsh host-machine pargap-script

Suppose host-machine is regulus.ccs.neu.edu and pargap-script is /usr/local/bin/pargap as in the above example, and we were to have trouble invoking ParGAP, then it would be a good idea to try invoking rsh regulus.ccs.neu.edu from a UNIX prompt and if that succeeds, to then try executing the full rsh command.

A typical problem is that the remote processor requires a password to login. MPI requires a login without passwords. Typically, /etc/hosts.equiv has not been set up to remove the password requirement for your remote host. Sometimes this can be solved by an appropriate .rhosts file in your home directory on the remote host. Sometimes, PAM is also used for user authentication (see /etc/pam.conf). man in.rshd also has helpful information. Consult your system staff for further analysis. In these days of hyper-security, rsh may be disabled at your site and you may have to use ssh instead; if so, there is a solution here: add the lines

#############################################################################
##
##  RSH . . . .. . . . . . . . . . . . . . . . .  remote shell used by ParGAP
##
##
RSH=ssh
export RSH

before the GAP block with the exec line. (Of course, the # lines are not needed; they are comments.)

Note that the remote ParGAP process will not read from standard input, although signals such as SIGINT (^C) may be received by the remote process. However, the remote ParGAP process will write to standard output, which is relayed to the local process. So,

gap> SendMsg("Exec(\"hostname\")", 2);

will execute and print from the remote process.

1.7 Problems with Installation

If you still have problems, here is a list of things to check.

0.

In versions of GAP earlier than GAP 4.3 some ParGAP ``hooks'' need to be added to GAP's lib/init.g file. Please add:

PAR_GAP_SLAVE_START := fail;

before the line:

       READ(GAP_RC_FILE);

and add:

if PAR_GAP_SLAVE_START <> fail then PAR_GAP_SLAVE_START(); fi;

at the end of the file.

1.

Do you have enough swap space to support multiple GAP processes? A simple way to check this is with the UNIX command, top. The Linux version of top sorts by memory usage if you type M.

2.

make tries to automatically create:

pkg/pargap/bin/pargap.sh

and copy the parameters from GAP_ROOT/bin/gap.sh. GAP_ROOT was specified when you executed ./configure GAP_ROOT to install ParGAP. This can be error-prone if your site has an unusual setup. If you execute GAP_ROOT/bin/gap.sh, does gap come up? If so, compare it with pargap.sh and check for correct settings in .../pkg/pargap/bin/pargap.sh?

3.

Did ParGAP find your procgroup file? [It looks in the current directory for procgroup, or for:

... -p4pg PATH/procgroup

on the command line.]

4.

Were the remote slave processes able to start up? If so, could they connect back to the master? To test connectivity problems, try manually starting a remote slave by executing a line in the script. Try a simple rsh remote-hostname to see if the issue is with security. If your site uses ssh instead of rsh, then there is a security issue. Read Section Problems with Passwords (Getting Around Security), and possibly man sshd.

5.

If the previous step failed due to security issues, such as requesting a password, you have several options. man rshd tells you the security model at your site (or possibly man ssh if you use that). Then read Section Problems with Passwords (Getting Around Security).

6.

Is the procgroup file in your current directory set correctly? Test it. If you are calling it on a remote host, manually type:

rsh HOSTNAME ParGAP

where HOSTNAME and ParGAP appear exactly as in procgroup, e.g.

rsh denali.ccs.neu.edu /usr/local/gap4r3/bin/pargap.sh

In some cases, exec is used to save process overhead. Also try:

rsh HOSTNAME exec ParGAP

If you plan to call it on localhost, try just: ParGAP

Note that if not all the slave processes succeed in connecting to the master, then ParGAP writes out a file:

/tmp/pargapmpi--rsh.$$

where $$ is replaced by the the process id of the ParGAP process.

7.

Is pargap listed in .../pkg/ALLPKG? [It's needed to autostart slaves.]

8.

Inside ParGAP, has MPI been successfully initialized? Try:

gap> MPI_Initialized();

9.

A remote (slave) ParGAP process starts in your home directory and tries to cd to a directory of the same name as your local directory. Check your assumptions about the remote machine. Try:

gap> SendRecvMsg("Exec(pwd)"); SendRecvMsg("UNIX_Hostname()");
gap> SendRecvMsg("UNIX_Getpid()");

10.

If the connection dies at random, after some period of time: You can experiment with SO_KEEPALIVE and variants. (See man setsockopt.) This periodically sends null messages so the remote machine does not think that the originating machine is dead. However, if the remote machine fails to reply, the local process sends a SIGPIPE signal to notify current processes of a broken socket, even though there might have been only a temporary lapse in connectivity. ssh specifies KeepAlive yes by default, but setting KeepAlive no might get you through some transient lapses in connectivity due to high congestion. You may also want to experiment with: setenv RSH "rsh -n"

11.

Read the documentation for further possible problems.

1.8 Problems with Hosts on Multiple Networks

If a host is on multiple networks, it will have multiple IP addresses and usually multiple hostnames. In this case, the master process cannot always guess correctly which IP address (which internet address) should be passed to the slave process, so that the slave process can call back to the master. In such cases, you may need to tell ParGAP which hostname or IP address to use for the callback. This is done by setting the UNIX environment variable, CALLBACK_HOST, as in the example below.

# [ in sh/bash/... ]
CALLBACK_HOST=denali.ccs.neu.edu; export CALLBACK_HOST
# [ in csh/tcsh/... ]
setenv CALLBACK_HOST=denali.ccs.neu.edu

The appropriate line for your shell can be placed in your shell initialization file. Alternatively, you can set this up for all users by placing the Bourne shell version (for sh) somewhere between the first and last line of .../pkg/pargap/bin/pargap.sh.

1.9 Problems with Passwords (Getting Around Security)

There is a simple test to see if you need to read this section. Pick a remote machine, HOSTNAME, that you wish to execute on, and type: rshHOSTNAME. If this did not work, also try ssh HOSTNAME. If you were asked for your password, then you and your system administrator may need to talk about security policy. If you were successful with ssh and not with rsh then set the environment variable, RSH, to the value ssh, as described in item 3 below.

(1)

Ask your systems administrator to put the machines in a hosts.equiv file, so that logging in from one to the other does not require a password. (man hosts.equiv)

(2)

Add a .rhosts file to your home directory (or .shosts for ssh).

(3)

Hack around the problem: By default, the startup script uses rsh to start remote processes. However, if the environment variable RSH was set, the script uses the value of the environment variable instead of rsh. This may be useful, if you have your own script, myrsh, that automatically gets around the security issues. Then just type:

RSH=myrsh; export RSH  # [ in sh/bash/... ]
setenv RSH myrsh       # [ in csh/tcsh/... ]

(4)

ssh: man ssh mentions some possibilities for giving the password the first time, and then having ssh remember that future logins to that machine are authorized for the duration of the session. Don't overlook the use of $HOME/.ssh/config to set special parameters, such as specifying a different login name on the remote machine. Some parameters of interest might be KeepAlive, RSAAuthentication, UseRsh. You may also find useful information in man sshd.

(5)

After starting ParGAP, manually call

/tmp/pargapmpi--rsh.$$

and repeatedly type in the password for each slave process. If you find yourself doing this, you may want to talk with your system administrator, since it actually hurts system security to have you repeatedly typing passwords with a concommitant risk that someone else will find out your password.

1.10 Modifying the GAP kernel

Note that this package modifies the GAP src and bin files, and creates a new GAP kernel. This new GAP kernel can be shared by traditional users of the old, sequential GAP kernel, and by those doing parallel processing.

The GAP kernel will have identical behavior to the old GAP kernel when invoked through the gap.sh script or the bin/@GAParch@/gap binary. The new ParGAP variables will appear to the end user ONLY if the GAP binary was invoked as pargapmpi: a symbolic link to the actual GAP binary. The script, pargap.sh, does this.

So, in a multi-user environment, traditional users can continue to use gap.sh without noticing any difference. Only an invocation of pargap.sh will add the new features.

In a future version of GAP, it is hoped that the GAP kernel will have enough ``hooks'', so that no modification of the GAP kernel is required. At that time, it will also be possible to speed up the startup time for ParGAP. Much of the startup time is caused by waiting for GAP to read its library files. It will be possible to use the GAP function, SaveWorkspace() to save a version with the GAP library pre-loaded. That saved version can then be used to start up ParGAP. This is not currently possible, because ParGAP needs to get at the command line of GAP before the GAP kernel sees it.

Comments and contributions to a ParGAP user library, or any other type of assistance, are gratefully accepted.

Gene Cooperman gene@ccs.neu.edu

[Up] [Next] [Index]

ParGAP manual
May 2002