Yikes; well, this definitely shouldn't be happening. :-(
Is there any chance that you can run this code through a memory
checking debugger such as valgrind or purify, just to check that there
aren't other issues going on? Check out the LAM FAQ for purify issues
-- be sure to use a LAM built with --with-purify so that you don't get
a bunch of false positives for read-from-uninitialized's.
On Apr 5, 2005, at 2:38 PM, Shi Jin wrote:
> Jeff,thanks a lot for replying.
> My LAM version is 7.0.2.
> When I said "blowing up", I mean t1 is actually set to
> "NaN" by the function calling MPI_Allreduce. And this
> will quickly make the code abort.
>
> Thanks.
>
> Shi
>
>
> --- Jeff Squyres <jsquyres_at_[hidden]> wrote:
>> On Apr 3, 2005, at 12:03 PM, Shi Jin wrote:
>>
>>> I recently had a very wired problem. I inherited
>> somebody 's MPI code
>>> but I only want to ran it with single process
>> since the problem size
>>> is too small to have any speedup with
>> parallization. But I still
>>> compile the code using mpif90 and run it with
>> "lamboot localhost"
>>> first. I ran it directly by ./Codename since it is
>> equavalent to
>>> "mpirun -np 1 ./Codename".
>>>
>>> But my code blew up at some point and my major
>> suspection is in the
>>> code, I have two lines at the end of one function
>> as:
>>> call
>> MPI_ALLREDUCE(energy,t1,1,dtype2,MPI_SUM,comm,ierr)
>>> call
>>
> MPI_ALLREDUCE(localbanden,t2,1,dtype2,MPI_SUM,comm,ierr)
>>> I suspect that the function returns before
>> MPI_ALLREDUCE actually set
>>> the correct number to t1 and t2. So I did a simple
>> remedy by adding a
>>> MPI_BARRIER after each MPI_ALLREDUCE and the code
>> runs fine forever.
>>
>> This is, indeed, quite odd -- what version of LAM
>> are you using?
>>
>> When the communicator only contains one process, the
>> ALLREDUCE should
>> effectively be a local memory copy (and nothing
>> else). And it cannot
>> return until energy==t1 and localbander==t2 --
>> you're right that MPI
>> does not guarantee synchronization through
>> collectives, but it does
>> guarantee that you're supposed to get the right
>> answers.
>>
>> When you say that your code "blew up", what,
>> exactly, do you mean? Do
>> you get wrong answers? Or does LAM abort your
>> process with some error?
>> Or something else?
>>
>> --
>> {+} Jeff Squyres
>> {+} jsquyres_at_[hidden]
>> {+} http://www.lam-mpi.org/
>>
>> _______________________________________________
>> This list is archived at
>> http://www.lam-mpi.org/MailArchives/lam/
>>
>
>
>
> __________________________________
> Do you Yahoo!?
> Yahoo! Personals - Better first dates. More second dates.
> http://personals.yahoo.com
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|