LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-03-29 16:04:28


On Mar 29, 2005, at 1:12 PM, Kumar, Ravi Ranjan wrote:

> I wrote a code in C++ using MPI. I divided a bigger block into smaller
> blocks and assigned each block to different node. I wish to run my
> code simultaneoulsy on different nodes in the lam. For this I wrote
> code:
>
> for(time=1;time<=Nt;time++)
>
> {
>
> do {
>
> //some data exchange between neigbouring (blocks) nodes
>
> //some computation in each block (node/process)
>
> MPI_Allreduce(...to find convergence condition...);
>
> } while(convergence reached)
>
> MPI_Barrier(..);
>
> }
>
> I want to have results from different processes at each time step then
> move to next time step. Next time step requires result from old time
> step.
>
> For this, I want increment in 'time' simultaneously on all the
> nodes/processes that is why I am using MPI_Barrier but this logic
> doesnt seem to work.
>
> When I run 10 processes on single node, all the processes end up
> simultaneously but when I run 10 processes on 5 nodes (2 processes per
> node), synchronization fails. Rank 1 & 6 lags behind where as rest of
> the ranks finish their work quite early. But when I use less number of
> processe/nodes (say 3 to 5), syncronization works well even without
> using MPI_Barrier.

How can you tell that MCW ranks 1 and 6 are lagging?

Be aware that the output sent from remote nodes does not necessarily
appear in any particular order on the mpirun stdout. The barrier that
you have in your loop should force all MPI processes to be more-or-less
exactly in step (meaning that MPI guarantees that no process leaves the
barrier until all processes have entered the barrier).

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/