LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-01-13 18:57:46


On Jan 13, 2006, at 1:54 PM, p p wrote:

> I am writing this email in order to tell you about
> the results of attaching a debugger to processes of a
> MPMD application. We have followed your advice
> (referring to Jeff) and tried valgrind to memcheck the
> MPMD application and found a serious problem in memory
> handling, which was not detectable when running on a
> single machine. Thanks again for your invaluable
> advice.

Excellent! When we started using memory-checking debuggers, we
wondered how we wrote software without them. :-)

> The funny part is that after correcting the memory
> problem, the processes blocked when trying to
> construct intercommunicators. After (many many) hours
> of debugging nothing seemed to go wrong. The
> intercommunicators should be constructed, but they
> were not. Then, we put the -ssi coll lam_basic option
> in mpirun and just like that... everything run
> perfectly!

Yoinks; this should absolutely not be the case.

One common case that people run into is a faulty assumption that
point to point message passing is guaranteed to progress during
collective operations. For example, LAM's shared memory collectives
do not use the point-to-point frameworks and instead are wholly
independent of the rest of LAM. Hence, when you go into a shared
memory collective, nothing else will progress.

Is it possible that this is happening in your application?

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/