On Tue, 22 Apr 2003, Pak, Anne O wrote:
> I am calling MPI_Comm_spawn in a main subroutine on the master node
> (node00) and wish to spawn processes on Node01, 02, 03 and 04. But it
> seems that invoking MPI_Comm_spawn also spawns a process on Node00 as
> well. Is that normal???
Yes. See the man page for MPI_Comm_spawn(3). You can use the "file" info
key to provide a LAM application schema (see mpirun(1) and app_schema(5)
for details) to control where your processes get launched. By default,
they start on n0 and go in a round-robin fashion.
Note that this is changed in the forthcoming LAM 7.0 release so that you
have more direct control without having to write out an app schema file.
Here's a snipit from MPI_Comm_spawn(3) in LAM 7.0:
Resource Allocation
LAM/MPI offers some MPI_Info keys for the placement of spawned
applications. Keys are looked for in the order listed below.
The first key that is found is used; any remaining keys are
ignored.
lam_spawn_file
The value of this key can be the filename of an appschema(1).
This allows the programmer to specify an arbitrary set of LAM
CPUs or nodes to spawn MPI processes on. In this case, only
the appschema is used to spawn the application; command, argv,
and maxprocs are all ignored (even at the root). Note that
even though maxprocs is ignored, errcodes must still be an
array long enough to hold an integer error code for every
process that tried to launch, or be the MPI constant
MPI_ERRCODES_IGNORE. Also note that MPI_Comm_spawn_multiple
does not accept the "lam_spawn_file" info key. As such, the
"lam_spawn_file" info key to MPI_Comm_spawn is mainly intended
to spawn MPMD applications and/or specify an arbitrary number
of nodes to run on.
Also note that this "lam_spawn_file" key is not portable to
other MPI implementations; it is a LAM/MPI-specific info key.
If specifying exact LAM nodes or CPUs is not necessary, users
should probably use MPI_Comm_spawn_multiple to make their
program more portable.
file
This key is a synonym for "lam_spawn_file". Since "file" is
not a LAM-specific name, yet this key carries a LAM-specific
meaning, its use is deprecated in favor of "lam_spawn_file".
lam_spawn_sched_round_robin
The value of this key is a string representing a LAM CPU or
node (using standard LAM nomenclature -- see mpirun(1)) to
begin spawning on. The use of this key allows the programmer
to indicate which node/CPU for LAM to start spawning on without
having to write out a temporary app schema file.
The CPU number is relative to the boot schema given to
lamboot(1). Only a single LAM node/CPU may be specified, such
as "n3" or "c1". If a node is specified, LAM will spawn one
MPI process per node. If a CPU is specified, LAM will scedule
one MPI process per CPU. An error is returned if "N" or "C" is
used.
Note that LAM is not involved with run-time scheduling of the
MPI process -- LAM only spawns processes on indicated nodes.
The operating system schedules these processes for executation
just like any other process. No attempt is made by LAM to bind
processes to CPUs. Hence, the "cX" nomenclature is just a
convenicence mechanism to inidicate how many MPI processes
should be spawned on a given node; it is not indicative of
operating system scheduling.
For "nX" values, the first MPI process will be spawned on the
indicated node. The remaining (maxprocs - 1) MPI processes
will be spawned on successive nodes. Specifically, if X is the
starting node number, process i will be launched on "nK", where
K = ((X + i) % total_nodes). LAM will modulus the node number
with the total number of nodes in the current LAM universe to
prevent errors, thereby creating a "wraparound" effect. Hence,
this mechanism can be used for round-robin scheduling,
regardless of how many nodes are in the LAM universe.
For "cX" values, the algorithm is essentially the same, except
that LAM will resolve "cX" to a specific node before spawning,
and successive processes are spawned on the node where "cK"
resides, where K = ((X + i) % total_cpus).
For example, if there are 8 nodes and 16 CPUs in the current
LAM universe (2 CPUs per node), a "lam_spawn_sched_round_robin"
key is given with the value of "c14", and maxprocs is 4, LAM
will spawn MPI processes on
CPU Node MPI_COMM_WORLD rank
--- ---- -------------------
c14 n7 0
c15 n7 1
c0 n0 2
c1 n0 3
No keys given
If none of the info keys listed above are used, the value of
MPI_INFO_NULL should be given for info (all other keys are
ignored, anyway - there is no harm in providing other keys).
In this case, LAM schedules the given number of processes onto
LAM nodes by starting with CPU 0 (or the lowest numbered CPU),
and continuing through higher CPU numbers, placing one process
on each CPU. If the process count is greater than the CPU
count, the procedure repeats.
Predefined Attributes
The pre-defined attribute on MPI_COMM_WORLD, MPI_UNIVERSE_SIZE,
can be useful in determining how many CPUs are currently
unused. For example, the value in MPI_UNIVERSE_SIZE is the
number of CPUs that LAM was booted with (see MPI_Init(1)).
Subtracting the size of MPI_COMM_WORLD from this value returns
the number of CPUs in the current LAM universe that the current
application is not using (and are therefore likely not being
used).
That's probably more information than you were looking for, but I hope
it's helpful. :-)
> I am looking to call some collective communication functions such as
> MPI_Bcast and MPI_Scatter, and have used the technique of merging
> the intercommunicator, created by MPI_Comm_spawn, into an
> INTRAcommunicator that I can use with MPI_Bcast, MPI_Scatter;
> however, since Node00 is in BOTH the local and remote groups of the
> intercommunicator, the intracommunicator created via
> MPI_Intercomm_merge lists Node00 twice. Hence, when I call say
> MPI_Bcast, Node00 gets two copies of the input. Is there a way of
> surpressing Node00 from the spawnED list (i.e. the remote group)?
Keep in mind the difference between nodes and processes. In the
example that you're citing, you originally have a process running on
Node00 (call it process A). You then spawn 4 processes, which end up
on 00, 01, 02, and 03 -- call these processes B, C, D, and E,
respectively. But remember that even though there are two processes
on Node00, they are distinct from each other -- A and B are on the
same node, but they are different processes. Hence, you can do
sends/receives/collectives with them and you will get correct answers.
It's not like process A is now in the communicator twice.
Make sense?
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|