On Fri, 8 Aug 2003, Pak, Anne O wrote:
> i have a piece of code which works perfectly fine on one cluster but
> doesn't, on another cluster, even though both clusters have been loaded
> up with the save versions of linux and LAM-MPI and are running the same
> exact code.
This sounds like memory badness. I say this not for any particular
technical reason from your description, but rather out of experience with
such nasty Heisenbugs (i.e., you can't know the bug and its location at
the same time). Memory badness somewhere in your code can lead to
indeterminate behavior like this.
Have you tried to run the pure MPI parts of your code through a memory
checking debugger such as Valgrind? I strongly recommend this -- it may
be quite enlightening. (i.e., don't worry too much about the Matlab part
-- those just joining in the conversation see prior posts from Ms. Pak on
this list -- just do the memory checking on the pure MPI codes). So
instead of MPI_Comm_spawn'ing your MPI program, instead MPI_Comm_spawn a
script that runs valgrind that runs your MPI program (and dumps the
valgrind output to a file somewhere). See the LAM FAQ under "Debugging",
for more details on this technique.
Before doing this, ensure to compile LAM with the --with-purify option to
avoid a lot of false positivies from the valgrind reports.
> i've pinpointed the line of code that's causing the code not to work on
> [snipped]
> fatal mpi_scatter call).
>
> What i've noticed is that the contents of the variable is being printed
> out two times in a row, even though i only have code to print it out
> once. what does this mean? in the same position as this MPI_scatter in
> the code, i've tried mpi-scattering other variables and the simulation
> runs fine, so it seems like there's something specifically wrong with
> this variable. but i don't know how to go about pinpointing what the
> problem is...it doesn't seeem to be the size of the variable nor the
> length of the variable name..what else could it be? and why would it be
> killing off the slaves?
This all sounds like the kind of randomness that typically comes from
memory badness in software. Although this may or may not be the actual
problem, I strongly recommend using valgrind (or whatever memory-checking
debugger you have access to -- valgrind will work on x86's with gcc) as
your next step. At the very least, using a memory-checking debugger will
point you in the right direction as to where to look next.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|