Sorry, I somehow missed your reply, but the problem is still there ...
I've answered your questions below.
Thanks for your time.
>On Tue, 6 Aug 2002, Bauer Holger wrote:
>> We experience an aging problem with lam-6.5.6 on a Sun workstation
>> cluster. Since we run the same code using MPICH and a native MPI version
>> on Suns, HPs and Linux I doubt that this is a problem related to our
>> code. Has anybody experienced suche a behaviour?
>>
>> time steps | CPU time/iteration
>> 40 | 107
>> 45 | 123
>> 50 | 138
>> 63 | 175
>>
>> and so on. During iterations going from one multigrid level to the next
>> (we have a grid based flow solver) this behaviour starts to start from
>> scratch, but there are no special MPI calls during this transient.
>A few questions:
>
>- Have you run this code through a memory-checking debugger?
Yes and no. The code itself has gone through purify several times but
without the LAM-MPI calls. That means we have one code for serial and
parallel usage and while using it serial we just emulate some of the MPI
calls used (most of them just copy-routines or simple returns (global
summazations, ...)).
Running the code in a debugger while using LAM (even using
checkbounds-options of our compiler) did not reveal any memory problems.
>- Are you making any MPI communication calls during the iteration, or is
> there any chance that they're getting serialized?
There are many MPI calls during the iteration. Most of them MPI_BROADCAST or
MPI_ALLREDUCE calls using MPI_COMM_WORLD. So the code is highly synchonized.
> You mentioned that
> there are no "special" MPI calls during the iteration -- what does that
> mean?
I would consider "special" most of the MPI-2 functionality that we do not
need and also MPI I/O routines. Also we don't make use of topology routines
and have only two or three communicater groups where most of the calls are
related to MPI_COMM_WORLD.
>- Your text seems to imply that this kind of slowdown behavior does not
> happen on the other MPI implementations; is that a correct assumption?
>
Yes, no problem with LAM with the same code on HP-UX and Linux at all.
Also, no prblem with MPICH on SunOS,HP-UX,Compaq Tru64 Unix, Linux.
No problem with native MPI on HP-UX, SunOS, NEC-SX-4/5.
So it should be somehow related to the combination of SunOS Fortran90
compiler + SunOS + LAM (so far 6.5.6 and 6.5.5 tested).
How could I debug such a thing? The results are correct! It's just this
annoying aging problem that is very disturbing.
>{+} Jeff Squyres
>{+} jsquyres_at_[hidden]
>{+} <http://www.lam-mpi.org/> http://www.lam-mpi.org/
Holger Bauer
Siemens AG
Power Generation
Muelheim
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|