LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2003-04-22 21:41:22


On Tue, 22 Apr 2003, Pak, Anne O wrote:

> I am calling MPI_Comm_spawn in a main subroutine on the master node
> (node00) and wish to spawn processes on Node01, 02, 03 and 04. But it
> seems that invoking MPI_Comm_spawn also spawns a process on Node00 as
> well. Is that normal???

Yes. See the man page for MPI_Comm_spawn(3). You can use the "file" info
key to provide a LAM application schema (see mpirun(1) and app_schema(5)
for details) to control where your processes get launched. By default,
they start on n0 and go in a round-robin fashion.

Note that this is changed in the forthcoming LAM 7.0 release so that you
have more direct control without having to write out an app schema file.
Here's a snipit from MPI_Comm_spawn(3) in LAM 7.0:

       Resource Allocation

       LAM/MPI offers some MPI_Info keys for the placement of spawned
       applications. Keys are looked for in the order listed below.
       The first key that is found is used; any remaining keys are
       ignored.

       lam_spawn_file

       The value of this key can be the filename of an appschema(1).
       This allows the programmer to specify an arbitrary set of LAM
       CPUs or nodes to spawn MPI processes on. In this case, only
       the appschema is used to spawn the application; command, argv,
       and maxprocs are all ignored (even at the root). Note that
       even though maxprocs is ignored, errcodes must still be an
       array long enough to hold an integer error code for every
       process that tried to launch, or be the MPI constant
       MPI_ERRCODES_IGNORE. Also note that MPI_Comm_spawn_multiple
       does not accept the "lam_spawn_file" info key. As such, the
       "lam_spawn_file" info key to MPI_Comm_spawn is mainly intended
       to spawn MPMD applications and/or specify an arbitrary number
       of nodes to run on.

       Also note that this "lam_spawn_file" key is not portable to
       other MPI implementations; it is a LAM/MPI-specific info key.
       If specifying exact LAM nodes or CPUs is not necessary, users
       should probably use MPI_Comm_spawn_multiple to make their
       program more portable.

       file

       This key is a synonym for "lam_spawn_file". Since "file" is
       not a LAM-specific name, yet this key carries a LAM-specific
       meaning, its use is deprecated in favor of "lam_spawn_file".

       lam_spawn_sched_round_robin

       The value of this key is a string representing a LAM CPU or
       node (using standard LAM nomenclature -- see mpirun(1)) to
       begin spawning on. The use of this key allows the programmer
       to indicate which node/CPU for LAM to start spawning on without
       having to write out a temporary app schema file.

       The CPU number is relative to the boot schema given to
       lamboot(1). Only a single LAM node/CPU may be specified, such
       as "n3" or "c1". If a node is specified, LAM will spawn one
       MPI process per node. If a CPU is specified, LAM will scedule
       one MPI process per CPU. An error is returned if "N" or "C" is
       used.

       Note that LAM is not involved with run-time scheduling of the
       MPI process -- LAM only spawns processes on indicated nodes.
       The operating system schedules these processes for executation
       just like any other process. No attempt is made by LAM to bind
       processes to CPUs. Hence, the "cX" nomenclature is just a
       convenicence mechanism to inidicate how many MPI processes
       should be spawned on a given node; it is not indicative of
       operating system scheduling.

       For "nX" values, the first MPI process will be spawned on the
       indicated node. The remaining (maxprocs - 1) MPI processes
       will be spawned on successive nodes. Specifically, if X is the
       starting node number, process i will be launched on "nK", where
       K = ((X + i) % total_nodes). LAM will modulus the node number
       with the total number of nodes in the current LAM universe to
       prevent errors, thereby creating a "wraparound" effect. Hence,
       this mechanism can be used for round-robin scheduling,
       regardless of how many nodes are in the LAM universe.

       For "cX" values, the algorithm is essentially the same, except
       that LAM will resolve "cX" to a specific node before spawning,
       and successive processes are spawned on the node where "cK"
       resides, where K = ((X + i) % total_cpus).

       For example, if there are 8 nodes and 16 CPUs in the current
       LAM universe (2 CPUs per node), a "lam_spawn_sched_round_robin"
       key is given with the value of "c14", and maxprocs is 4, LAM
       will spawn MPI processes on

       CPU Node MPI_COMM_WORLD rank
       --- ---- -------------------
       c14 n7 0
       c15 n7 1
       c0 n0 2
       c1 n0 3

       No keys given

       If none of the info keys listed above are used, the value of
       MPI_INFO_NULL should be given for info (all other keys are
       ignored, anyway - there is no harm in providing other keys).
       In this case, LAM schedules the given number of processes onto
       LAM nodes by starting with CPU 0 (or the lowest numbered CPU),
       and continuing through higher CPU numbers, placing one process
       on each CPU. If the process count is greater than the CPU
       count, the procedure repeats.

       Predefined Attributes

       The pre-defined attribute on MPI_COMM_WORLD, MPI_UNIVERSE_SIZE,
       can be useful in determining how many CPUs are currently
       unused. For example, the value in MPI_UNIVERSE_SIZE is the
       number of CPUs that LAM was booted with (see MPI_Init(1)).
       Subtracting the size of MPI_COMM_WORLD from this value returns
       the number of CPUs in the current LAM universe that the current
       application is not using (and are therefore likely not being
       used).

That's probably more information than you were looking for, but I hope
it's helpful. :-)

> I am looking to call some collective communication functions such as
> MPI_Bcast and MPI_Scatter, and have used the technique of merging
> the intercommunicator, created by MPI_Comm_spawn, into an
> INTRAcommunicator that I can use with MPI_Bcast, MPI_Scatter;
> however, since Node00 is in BOTH the local and remote groups of the
> intercommunicator, the intracommunicator created via
> MPI_Intercomm_merge lists Node00 twice. Hence, when I call say
> MPI_Bcast, Node00 gets two copies of the input. Is there a way of
> surpressing Node00 from the spawnED list (i.e. the remote group)?

Keep in mind the difference between nodes and processes. In the
example that you're citing, you originally have a process running on
Node00 (call it process A). You then spawn 4 processes, which end up
on 00, 01, 02, and 03 -- call these processes B, C, D, and E,
respectively. But remember that even though there are two processes
on Node00, they are distinct from each other -- A and B are on the
same node, but they are different processes. Hence, you can do
sends/receives/collectives with them and you will get correct answers.

It's not like process A is now in the communicator twice.

Make sense?

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/