LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: enrico.sirola_at_[hidden]
Date: 2003-05-13 03:02:00


>>>>> "Jeff" == Jeff Squyres <jsquyres_at_[hidden]> writes:

[...]

    Jeff> Sounds reasonable.

very well :)

>> While trying to port this application to lam, I used the
>> following approach:
>>
>> 0. the master starts 1. then spawns a number of processes
>> 2. the master merges the just created intercommunicator with
>> its MPI::COMM_WORLD 3. the master sends initialization data to
>> the clients, using the new intracommunicator 4. the client(s)
>> start calculating, while the master blocks in a MPI_Recv()

    Jeff> You might want to change this step a little -- from your
    Jeff> description, it sounds like this will force a serialization
    Jeff> of the process. You might want to use MPI_Irecv and get an
    Jeff> MPI_Request out of it. Then you can go do other things
    Jeff> while the client is computing (e.g., go spawn more clients),
    Jeff> and periodically do an MPI_Test or MPI_Wait to see if the
    Jeff> request has finished (assumedly, you'll build up an array of
    Jeff> MPI_Requests, one for each outstanding client, or perhaps
    Jeff> just a single receive using MPI_ANY_SOURCE).

    Jeff> Just a suggestion.

actually I issue a MPI_Recv using MPI_ANY_SOURCE. If I have a cluster
with N processors, I spawn N slaves. So the master blocks doing
nothing while the N slaves use all the computing power.

[...]

>> At this point, I am clueless. Does lam support this behavior
>> from the master/slave programs, i.e. slaves exiting before
>> master and then master spawning news slaves? I hope some lam
>> guru (or maybe just someone who isn't a newbie like i am) will
>> help.

    Jeff> Yes, LAM supports this.

    Jeff> However, it sounds like you are using MPI slightly
    Jeff> incorrectly. What communicator are you using in the spawn
    Jeff> call? The specific error message that LAM is giving you is
    Jeff> saying that you are spawning over a communicator that now
    Jeff> has a dead process in it. I suspect that this means that
    Jeff> you're using a communicator that used to contain a child,
    Jeff> but that child has now exited.

    Jeff> The overriding principle here is that once a child exits,
    Jeff> you can never use that communicator again. Even better --
    Jeff> you should probably free the intercommunicator that you got
    Jeff> back from MPI_Comm_spawn (and any other communicators that
    Jeff> you made from it) when the child exits. That way, the
    Jeff> parent effectively has no memory of that child.

Ah! This sounds very intresting. What I did was just to use the "old"
communicator to spawn the new process...
Yesterday I tried to directly use the intercommunicators with the
slaves instead of merging the intercommunicators in the master's
MPI::COMM_SELF intracomm. Then I instance a MPI::Request (using IRecv)
for each slave and test for request completion and this approach seems
to work.
Thanks a lot for your help,
enrico

-- 
Enrico Sirola <enrico.sirola_at_[hidden]>
gpg public key available from www.keyserver.net, Key ID 0x377FE07F
Key fingerprint = B446 7332 ED55 BC68 5FE8  DE0F 98DF EC86 377F E07F