On Fri, 9 May 2003 enrico.sirola_at_[hidden] wrote:
> I'm trying to port a PVM application to lam. This is an application for
> a distributed montecarlo which works as follows, using a master/slave
> paradigm and point-to-point communication with slaves: the master is a
> python script, which reads data from different data sources, then
> instance the "true" PVM master application, which a c++ class wrapped to
> python. Then the c++ extension class starts the PVM machinery and begin
> spawning slaves telling each slave how many paths to compute, when a
> slave finishes its job, it reports to the master and exits, so the
> master spawns a new slave, and so on until the calculation is ended.
Sounds reasonable.
> While trying to port this application to lam, I used the following
> approach:
>
> 0. the master starts
> 1. then spawns a number of processes
> 2. the master merges the just created intercommunicator with its
> MPI::COMM_WORLD
> 3. the master sends initialization data to the clients, using
> the new intracommunicator
> 4. the client(s) start calculating, while the master blocks in a
> MPI_Recv()
You might want to change this step a little -- from your description, it
sounds like this will force a serialization of the process. You might
want to use MPI_Irecv and get an MPI_Request out of it. Then you can go
do other things while the client is computing (e.g., go spawn more
clients), and periodically do an MPI_Test or MPI_Wait to see if the
request has finished (assumedly, you'll build up an array of MPI_Requests,
one for each outstanding client, or perhaps just a single receive using
MPI_ANY_SOURCE).
Just a suggestion.
> 5. one of the clients finishes its job, and issues a MPI_Send
> 6. the master receives the calculated data
> 7. the client exit the MPI_Send(), then exits
> 8. return to 1. spwaning 1 new process only.
>
> the problem is the master crashes just after 8. here is what i get
> if I print some log messages on console:
>
> sending exit code <--- slave is going to exit
> sending to 0 <--- sending exit code to master
> sent to 0 <--- sent
> received message from 1 of size 65 <--- master is receiving exit code
> reveived from 1 <--- done
> Process 0 finished on sirola01.ge.risk <--- master collecting
> results from slave
> 2 paths added
> 0 failures so far
> sent exit code <--- slave before MPI_Finalize()
> CHILD EXITING <--- slave after MPI_Finalize()
> MPI_Recv: process in local group is dead (rank 0, comm 3) <--- lam complains about a process exiting?
> spawning 1 child(s) with tag 1... from tag 0Rank (0, MPI_COMM_WORLD): <--- master tries to spawn a new process, and crashes (lam traceback follows)
> Call stack within LAM:
> Rank (0, MPI_COMM_WORLD): - MPI_Recv()
> Rank (0, MPI_COMM_WORLD): - MPI_Reduce()
> Rank (0, MPI_COMM_WORLD): - MPI_Comm_spawn()
> Rank (0, MPI_COMM_WORLD): - main()
>
> At this point, I am clueless. Does lam support this behavior from the
> master/slave programs, i.e. slaves exiting before master and then master
> spawning news slaves? I hope some lam guru (or maybe just someone who
> isn't a newbie like i am) will help.
Yes, LAM supports this.
However, it sounds like you are using MPI slightly incorrectly. What
communicator are you using in the spawn call? The specific error
message that LAM is giving you is saying that you are spawning over a
communicator that now has a dead process in it. I suspect that this
means that you're using a communicator that used to contain a child,
but that child has now exited.
The overriding principle here is that once a child exits, you can
never use that communicator again. Even better -- you should probably
free the intercommunicator that you got back from MPI_Comm_spawn (and
any other communicators that you made from it) when the child exits.
That way, the parent effectively has no memory of that child.
Hope that helps.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|