LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2003-07-25 11:37:09


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thursday, July 24, 2003, at 04:23 PM, Pak, Anne O wrote:

> i have a matlab simulation where matlab calls a mex function. the mex
> function spawns off a master node and the master node spawns off
> multiple slave nodes.
>
> in the matlab program, i have a loop and for each iteration in the
> loop, this mex function is called.
> in the first iteration through this loop, the mex function spawns off
> a master node, the master node publishes it name, the mex program and
> master node kill the intercommunicator created during the spawn and
> then immediately proceeds to do a connect/accept. for all subsequent
> iterations in this matlab loop, the mex program merely needs to
> connect (not spawns) to the master node.

<snip>

> for a few iterations in the loop (but by no means close to completing
> the intended number of iterations in the matlab loop) and then
> suddenly i see
> TASK (G/L) FUNCTION PEER|ROOT TAG COMM COUNT
> DATATYPE
> 0/0 <unknown> Comm_connect 0/0 WORLD*
> (i.e. master and slaves all disappear)
>
> the one labeled <unknown> contains my matlab/MEX code
> the one labeled 'master' is the master node that is spawned from MEX
> on <unknown>
> and the ones labeled 'slave' are spawned by 'master'
>
> what differences between the two clusters could be causing this
> problem?
> version of LAM? linux version? compiler? something hardware related
> perhaps?

It's pretty hard to know for sure what is going on, but it could be any
of the above. One of the most common issues that causes the problems
you are seeing is memory badness somewhere in your code. Overwrite
random memory on one platform and everything still works. Do the same
on another platform, and suddenly it dies. Unfortunately, your setup
is not really conducive to the normal ways of testing this theory out
(using a memory checking debugger or running your application under a
debugger) as you are going to be spawning all the "interesting"
processes.

As long as you are compiling with the correct version of LAM (ie,
recompiling your application for each cluster so that you use the right
header files and libraries), there isn't much that should cause
problems from LAM's point of view. You can run 'lamboot -V' to see
what versions of LAM you are using - as long as they are both in the
same minor version (6.5 vs 7.0), there really wasn't much that changed
in the dynamic processes over the release cycles. But it is worth
trying the same version of LAM across the two clusters.

Depending on what version of LAM you are using, you may be able to get
some useful information from the debugging output of the LAM daemons.
You would need to start LAM with the -d option (at lamboot time). In
LAM 6.5.x, the daemons will dump a bunch of information into syslog.
In LAM 7.0, they dump it in a file in your LAM session directory in
/tmp/. I believe that the daemons will record the reason a process
died in the logs - it might prove useful to see what exactly caused the
failures you are seeing - if the processes exited cleanly, caught a
signal, etc.

Other than that, printf() debugging is probably your only option.
Since you don't have access to stdout and stderr from spawned processes
(we'll make it work eventually, we promise :) ), you probably have to
dump all the printf()s to a file, which is a pain, but should work
without too many problems.

> btw, what flags can i use with mpitask or whatnot to get more
> information than what's show above? maybe something that would help
> me track down WHY the slave are dying..for some reason mpimsg doesn't
> work on my cluster...

mpimsg only works with the lamd RPI - I believe we updated the docs on
that in the 7.0 release, but it wasn't really clear in the previous
docs. Mpitask is not capable of showing more information than what you
see above. There are a number of flags (see mpitask -h) that can
filter the information, but not much else it can display.

Hope this helps some,

Brian

- --
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (Darwin)

iD8DBQE/IVy53TvSMqaebW4RAlHHAKDq9mVLFkZSMNLMf+vhtcmJIoi3iwCgxJv9
MBAfwgMq7JU9blAJrei/6Kc=
=+0Pn
-----END PGP SIGNATURE-----