LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-04-05 07:54:43


On Apr 3, 2005, at 12:03 PM, Shi Jin wrote:

> I recently had a very wired problem. I inherited somebody 's MPI code
> but I only want to ran it with single process since the problem size
> is too small to have any speedup with parallization. But I still
> compile the code using mpif90 and run it with "lamboot localhost"
> first. I ran it directly by ./Codename since it is equavalent to
> "mpirun -np 1 ./Codename".
>
> But my code blew up at some point and my major suspection is in the
> code, I have two lines at the end of one function as:
> call MPI_ALLREDUCE(energy,t1,1,dtype2,MPI_SUM,comm,ierr)
> call MPI_ALLREDUCE(localbanden,t2,1,dtype2,MPI_SUM,comm,ierr)
> I suspect that the function returns before MPI_ALLREDUCE actually set
> the correct number to t1 and t2. So I did a simple remedy by adding a
> MPI_BARRIER after each MPI_ALLREDUCE and the code runs fine forever.

This is, indeed, quite odd -- what version of LAM are you using?

When the communicator only contains one process, the ALLREDUCE should
effectively be a local memory copy (and nothing else). And it cannot
return until energy==t1 and localbander==t2 -- you're right that MPI
does not guarantee synchronization through collectives, but it does
guarantee that you're supposed to get the right answers.

When you say that your code "blew up", what, exactly, do you mean? Do
you get wrong answers? Or does LAM abort your process with some error?
  Or something else?

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/