Hello all.
I'm struggling with something that seems to be a familiar topic on this mailing
list. Any help would be appreciated.
I'm trying to have a 'master' program start up a number of 'slave' programs by a
series of spawn calls. (I know I can spawn multiple programs with one call to
spawn or spawn_multiple, but for other reasons, i must do it this way...).
The general problem is trying to get an intracommunicator that includes the
whole bunch. I understand that I can use spawn and intercomm_merge, and that
these calls are collective. This seems to work fine except when I run on certain
nodes on the cluster I am working on; from some logging, it seems that two
processes end up thinking that they are the same rank from a given intracomm.
here are the steps:
**master**
use MPI_COMM_SELF as starting intracomm
loop begin
(notify existing processes to collectively spawn/merge)
spawns a process using intracomm
merges the returned intercomm (from the spawn) into intracomm
loop end
**slave**
merges parent intercomm into intracomm.
loop begin
if notified, spawn (using intracomm)
merge (using intercomm returned from spawn) into intracomm
loop end
also, the master is changing the "lam_spawn_sched_round_robin" key before each
spawn, if that might be an issue...
Any ideas?
Thanks in advance!
--dp
-----------------------------------------------------------------
This mail was sent through IMP Webmail at http://www.imp3.tut.fi/
-----------------------------------------------------------------
|