LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: enrico.sirola_at_[hidden]
Date: 2003-05-09 02:34:04


Hello,
I'm trying to port a PVM application to lam. This is an application
for a distributed montecarlo which works as follows, using a
master/slave paradigm and point-to-point communication with slaves:
the master is a python script, which reads data from different data
sources, then instance the "true" PVM master application, which a c++
class wrapped to python. Then the c++ extension class starts the PVM
machinery and begin spawning slaves telling each slave how many paths
to compute, when a slave finishes its job, it reports to the master
and exits, so the master spawns a new slave, and so on until the
calculation is ended.

While trying to port this application to lam, I used the following
approach:

0. the master starts
1. then spawns a number of processes
2. the master merges the just created intercommunicator with its
   MPI::COMM_WORLD
3. the master sends initialization data to the clients, using
   the new intracommunicator
4. the client(s) start calculating, while the master blocks in a
   MPI_Recv()
5. one of the clients finishes its job, and issues a MPI_Send
6. the master receives the calculated data
7. the client exit the MPI_Send(), then exits
8. return to 1. spwaning 1 new process only.

the problem is the master crashes just after 8. here is what i get
if I print some log messages on console:

sending exit code <--- slave is going to exit
sending to 0 <--- sending exit code to master
sent to 0 <--- sent
received message from 1 of size 65 <--- master is receiving exit code
reveived from 1 <--- done
Process 0 finished on sirola01.ge.risk <--- master collecting
                                              results from slave
        2 paths added
        0 failures so far
sent exit code <--- slave before MPI_Finalize()
CHILD EXITING <--- slave after MPI_Finalize()
MPI_Recv: process in local group is dead (rank 0, comm 3) <--- lam complains about a process exiting?
spawning 1 child(s) with tag 1... from tag 0Rank (0, MPI_COMM_WORLD): <--- master tries to spawn a new process, and crashes (lam traceback follows)
Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Recv()
Rank (0, MPI_COMM_WORLD): - MPI_Reduce()
Rank (0, MPI_COMM_WORLD): - MPI_Comm_spawn()
Rank (0, MPI_COMM_WORLD): - main()

At this point, I am clueless. Does lam support this behavior from the
master/slave programs, i.e. slaves exiting before master and then
master spawning news slaves? I hope some lam guru (or maybe just
someone who isn't a newbie like i am) will help.
Thanks in advance,
Enrico

-- 
Enrico Sirola <sirola_at_[hidden]>
gpg public key available from www.keyserver.net, Key ID 0x377FE07F
Key fingerprint = B446 7332 ED55 BC68 5FE8  DE0F 98DF EC86 377F E07F