LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Pak, Anne O (anne.o.pak_at_[hidden])
Date: 2003-07-24 18:23:30


hello:

i have a matlab simulation where matlab calls a mex function. the mex function spawns off a master node and the master node spawns off multiple slave nodes.

in the matlab program, i have a loop and for each iteration in the loop, this mex function is called.
in the first iteration through this loop, the mex function spawns off a master node, the master node publishes it name, the mex program and master node kill the intercommunicator created during the spawn and then immediately proceeds to do a connect/accept. for all subsequent iterations in this matlab loop, the mex program merely needs to connect (not spawns) to the master node.

also in the first iteration through the matlab loop, the master node spawns off a bunch of slave nodes. the program on the slave nodes immediately enters an infinite while loop and the connection between the master and slave is maintained until the loop in the matlab program ends, at which time, a flag is sent to the mex, which sends a message to the master before disconnecting from it. the master then broads this *die* message to all the slaves to disconnect from the master.

so the only time the slave should be disconnecting from the master is upon completion of this matlab loop. the matlab and mex nodes connect and disconnect each iteration of the loop.

when i run this massive matlab/mex/mpi program on one cluster, it works fine. until it comes time to disconnect the slaves from the master at the end of the matlab loop, if you use mpitask, you see:

TASK (G/L) FUNCTION PEER|ROOT TAG COMM COUNT DATATYPE
0/0 master Comm_accept 0/0 SELF*
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 <unknown> <running>

however, when i port the same exact code to another cluster, i see
TASK (G/L) FUNCTION PEER|ROOT TAG COMM COUNT DATATYPE
0/0 master Comm_accept 0/0 SELF*
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 slave <running>
0 <unknown> <running>

for a few iterations in the loop (but by no means close to completing the intended number of iterations in the matlab loop) and then suddenly i see
TASK (G/L) FUNCTION PEER|ROOT TAG COMM COUNT DATATYPE
0/0 <unknown> Comm_connect 0/0 WORLD*
(i.e. master and slaves all disappear)

the one labeled <unknown> contains my matlab/MEX code
the one labeled 'master' is the master node that is spawned from MEX on <unknown>
and the ones labeled 'slave' are spawned by 'master'

what differences between the two clusters could be causing this problem?
version of LAM? linux version? compiler? something hardware related perhaps?

btw, what flags can i use with mpitask or whatnot to get more information than what's show above? maybe something that would help me track down WHY the slave are dying..for some reason mpimsg doesn't work on my cluster...

any clues?

anne

___________________________________________________
Anne Pak, L1-50
Building 153 2G8
1111 Lockheed Martin Way
Sunnyvale, CA 94089
(408) 742-4369 (W)
(408) 742-4697 (F)