>>>>> "Jeff" == Jeff Squyres <jsquyres_at_[hidden]> writes:
[...]
Jeff> Sounds reasonable.
very well :)
>> While trying to port this application to lam, I used the
>> following approach:
>>
>> 0. the master starts 1. then spawns a number of processes
>> 2. the master merges the just created intercommunicator with
>> its MPI::COMM_WORLD 3. the master sends initialization data to
>> the clients, using the new intracommunicator 4. the client(s)
>> start calculating, while the master blocks in a MPI_Recv()
Jeff> You might want to change this step a little -- from your
Jeff> description, it sounds like this will force a serialization
Jeff> of the process. You might want to use MPI_Irecv and get an
Jeff> MPI_Request out of it. Then you can go do other things
Jeff> while the client is computing (e.g., go spawn more clients),
Jeff> and periodically do an MPI_Test or MPI_Wait to see if the
Jeff> request has finished (assumedly, you'll build up an array of
Jeff> MPI_Requests, one for each outstanding client, or perhaps
Jeff> just a single receive using MPI_ANY_SOURCE).
Jeff> Just a suggestion.
actually I issue a MPI_Recv using MPI_ANY_SOURCE. If I have a cluster
with N processors, I spawn N slaves. So the master blocks doing
nothing while the N slaves use all the computing power.
[...]
>> At this point, I am clueless. Does lam support this behavior
>> from the master/slave programs, i.e. slaves exiting before
>> master and then master spawning news slaves? I hope some lam
>> guru (or maybe just someone who isn't a newbie like i am) will
>> help.
Jeff> Yes, LAM supports this.
Jeff> However, it sounds like you are using MPI slightly
Jeff> incorrectly. What communicator are you using in the spawn
Jeff> call? The specific error message that LAM is giving you is
Jeff> saying that you are spawning over a communicator that now
Jeff> has a dead process in it. I suspect that this means that
Jeff> you're using a communicator that used to contain a child,
Jeff> but that child has now exited.
Jeff> The overriding principle here is that once a child exits,
Jeff> you can never use that communicator again. Even better --
Jeff> you should probably free the intercommunicator that you got
Jeff> back from MPI_Comm_spawn (and any other communicators that
Jeff> you made from it) when the child exits. That way, the
Jeff> parent effectively has no memory of that child.
Ah! This sounds very intresting. What I did was just to use the "old"
communicator to spawn the new process...
Yesterday I tried to directly use the intercommunicators with the
slaves instead of merging the intercommunicators in the master's
MPI::COMM_SELF intracomm. Then I instance a MPI::Request (using IRecv)
for each slave and test for request completion and this approach seems
to work.
Thanks a lot for your help,
enrico
--
Enrico Sirola <enrico.sirola_at_[hidden]>
gpg public key available from www.keyserver.net, Key ID 0x377FE07F
Key fingerprint = B446 7332 ED55 BC68 5FE8 DE0F 98DF EC86 377F E07F
|