LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-09-01 21:41:46


On Sep 1, 2005, at 5:43 AM, Pierre Valiron wrote:

> Well, I finally found the problem was related to the behaviour of
> MPI_INIT.
> The code snippet below is buggy when started ever many nodes and procs:
>
> call MPI_Init(err)
> call MPI_Comm_rank(MPI_COMM_WORLD,me,err)
> call MPI_Comm_size(MPI_COMM_WORLD,nprocs,err)
> (some work)
> call MPI_Finalize(err)
> end
>
> If I include
> call MPI_Barrier(MPI_COMM_WORLD,err)
> right after MPI_Init, all problems disappear.

That's quite surprising.

> I could not exactly what has been cured by the MPI_Barrier call. Fix a
> wong MPI_Comm_rank or MPI_Comm_size, or a not fully functional MPI
> environment, hard to say as one process dies before writing anything...
> Using mpirun -s reduces the occurence of the bug, but does not provide
> a
> cure. For some unknown reason, adding a sleep after lamboot also helps.

One thing that I would be wary of is that prior to [unreleased] version
7.1.2, lamhalt will complete up to 1-2 seconds *before* the universe
has shut down. So if you have a fast-repeating system of:

repeat:
        lamboot ...
        mpirun ...
        lamhalt

You could actually have problems with the lamboot or mpirun getting
killed by the end-effects of the prior lamhalt.

Can you try putting a "sleep 2" after the lamhalt and see if that
helps? I ask because this seems to be a timing problem -- adding
delays at various stages in the pipeline seem to make the frequency of
the problem decrease.

LAM 7.1.2 changes lamhalt such that it won't quit until the universe is
fully dead.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/