LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Pierre Valiron (Pierre.Valiron_at_[hidden])
Date: 2005-09-02 09:10:49


Jeff Squyres wrote:

>On Sep 1, 2005, at 5:43 AM, Pierre Valiron wrote:
>
>
>
>>Well, I finally found the problem was related to the behaviour of
>>MPI_INIT.
>>The code snippet below is buggy when started ever many nodes and procs:
>>
>> call MPI_Init(err)
>> call MPI_Comm_rank(MPI_COMM_WORLD,me,err)
>> call MPI_Comm_size(MPI_COMM_WORLD,nprocs,err)
>> (some work)
>> call MPI_Finalize(err)
>> end
>>
>>If I include
>> call MPI_Barrier(MPI_COMM_WORLD,err)
>>right after MPI_Init, all problems disappear.
>>
>>
>
>That's quite surprising.
>
>
>
>>I could not exactly what has been cured by the MPI_Barrier call. Fix a
>>wong MPI_Comm_rank or MPI_Comm_size, or a not fully functional MPI
>>environment, hard to say as one process dies before writing anything...
>>Using mpirun -s reduces the occurence of the bug, but does not provide
>>a
>>cure. For some unknown reason, adding a sleep after lamboot also helps.
>>
>>
>
>One thing that I would be wary of is that prior to [unreleased] version
>7.1.2, lamhalt will complete up to 1-2 seconds *before* the universe
>has shut down. So if you have a fast-repeating system of:
>
>repeat:
> lamboot ...
> mpirun ...
> lamhalt
>
>You could actually have problems with the lamboot or mpirun getting
>killed by the end-effects of the prior lamhalt.
>
>Can you try putting a "sleep 2" after the lamhalt and see if that
>helps? I ask because this seems to be a timing problem -- adding
>delays at various stages in the pipeline seem to make the frequency of
>the problem decrease.
>
>LAM 7.1.2 changes lamhalt such that it won't quit until the universe is
>fully dead.
>
>
>
I already tested the 7.1.2b25 release (including the new lamhalt I
presume) which presented the same faulty behaviour as the 7.1.1.

However computers are strange machines.

One of our nodes had a defective gigabit interface on the motherboard
(out of two, so it remained functional) and was changed today. Since
this replacement I can't reproduce the bug, which seemed to be related
to this weird motherboard problem.

So all is fine again for the moment !

Many thanks to Bogdan and Jeff for your advices.
Pierre.

-- 
Soutenez le mouvement SAUVONS LA RECHERCHE :
http://recherche-en-danger.apinc.org/
       _/_/_/_/    _/       _/       Dr. Pierre VALIRON
      _/     _/   _/      _/   Laboratoire d'Astrophysique
     _/     _/   _/     _/    Observatoire de Grenoble / UJF
    _/_/_/_/    _/    _/    BP 53  F-38041 Grenoble Cedex 9 (France)
   _/          _/   _/    http://www-laog.obs.ujf-grenoble.fr/~valiron/
  _/          _/  _/     Mail: Pierre.Valiron_at_[hidden]
 _/          _/ _/      Phone: +33 4 7651 4787  Fax: +33 4 7644 8821
_/          _/_/