LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: petrovic_at_[hidden]
Date: 2005-04-01 09:21:24


Hello all.

I'm struggling with something that seems to be a familiar topic on this mailing
list. Any help would be appreciated.

I'm trying to have a 'master' program start up a number of 'slave' programs by a
series of spawn calls. (I know I can spawn multiple programs with one call to
spawn or spawn_multiple, but for other reasons, i must do it this way...).

The general problem is trying to get an intracommunicator that includes the
whole bunch. I understand that I can use spawn and intercomm_merge, and that
these calls are collective. This seems to work fine except when I run on certain
nodes on the cluster I am working on; from some logging, it seems that two
processes end up thinking that they are the same rank from a given intracomm.

here are the steps:

**master**
use MPI_COMM_SELF as starting intracomm
loop begin
(notify existing processes to collectively spawn/merge)
spawns a process using intracomm
merges the returned intercomm (from the spawn) into intracomm
loop end

**slave**
merges parent intercomm into intracomm.
loop begin
if notified, spawn (using intracomm)
merge (using intercomm returned from spawn) into intracomm
loop end

also, the master is changing the "lam_spawn_sched_round_robin" key before each
spawn, if that might be an issue...

Any ideas?
Thanks in advance!
--dp

-----------------------------------------------------------------
This mail was sent through IMP Webmail at http://www.imp3.tut.fi/
-----------------------------------------------------------------