Hello,
I'm trying to port a PVM application to lam. This is an application
for a distributed montecarlo which works as follows, using a
master/slave paradigm and point-to-point communication with slaves:
the master is a python script, which reads data from different data
sources, then instance the "true" PVM master application, which a c++
class wrapped to python. Then the c++ extension class starts the PVM
machinery and begin spawning slaves telling each slave how many paths
to compute, when a slave finishes its job, it reports to the master
and exits, so the master spawns a new slave, and so on until the
calculation is ended.
While trying to port this application to lam, I used the following
approach:
0. the master starts
1. then spawns a number of processes
2. the master merges the just created intercommunicator with its
MPI::COMM_WORLD
3. the master sends initialization data to the clients, using
the new intracommunicator
4. the client(s) start calculating, while the master blocks in a
MPI_Recv()
5. one of the clients finishes its job, and issues a MPI_Send
6. the master receives the calculated data
7. the client exit the MPI_Send(), then exits
8. return to 1. spwaning 1 new process only.
the problem is the master crashes just after 8. here is what i get
if I print some log messages on console:
sending exit code <--- slave is going to exit
sending to 0 <--- sending exit code to master
sent to 0 <--- sent
received message from 1 of size 65 <--- master is receiving exit code
reveived from 1 <--- done
Process 0 finished on sirola01.ge.risk <--- master collecting
results from slave
2 paths added
0 failures so far
sent exit code <--- slave before MPI_Finalize()
CHILD EXITING <--- slave after MPI_Finalize()
MPI_Recv: process in local group is dead (rank 0, comm 3) <--- lam complains about a process exiting?
spawning 1 child(s) with tag 1... from tag 0Rank (0, MPI_COMM_WORLD): <--- master tries to spawn a new process, and crashes (lam traceback follows)
Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Recv()
Rank (0, MPI_COMM_WORLD): - MPI_Reduce()
Rank (0, MPI_COMM_WORLD): - MPI_Comm_spawn()
Rank (0, MPI_COMM_WORLD): - main()
At this point, I am clueless. Does lam support this behavior from the
master/slave programs, i.e. slaves exiting before master and then
master spawning news slaves? I hope some lam guru (or maybe just
someone who isn't a newbie like i am) will help.
Thanks in advance,
Enrico
--
Enrico Sirola <sirola_at_[hidden]>
gpg public key available from www.keyserver.net, Key ID 0x377FE07F
Key fingerprint = B446 7332 ED55 BC68 5FE8 DE0F 98DF EC86 377F E07F
|