LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-12-03 19:22:29


Without any information on your application, it sounds like a classic
MPI problem: assuming buffering in sends.

For example, consider the following running on two MPI processes
(assume that peer==0 for MCW rank 1 and peer==1 for MCW rank 0):

MPI_Send(sbuffer, ..., peer, tag, MPI_COMM_WORLD);
MPI_Recv(rbuffer, ..., peer, tag, MPI_COMM_WORLD, &status);

This is erroneous, and MPI says that this is allowed to block.
Specifically, MPI_SEND is allowed to block until a matching receive is
posted (but is not required to). If *both* processes post their send
and wait for the matching receive, the code clearly deadlocks.

Is it possible that your application is exhibiting this kind of
behavior?

If so, the canonical solution is to do something like:

if (myrank == 0) {
   MPI_Send(sbuffer, ...);
   MPI_Recv(rbuffer, ...);
} else if (myrank == 1) {
   MPI_Recv(rbuffer, ...);
   MPI_Send(sbuffer, ...);
}

On Dec 3, 2004, at 5:23 AM, Atle Svandal wrote:

> Machine:
>
>  
>
> 2x Athlon MP2400 machine running red hat 9.0 connected in a cluster
> with 4 similar machines.
>
>  
>
> Problem:
>
>  
>
> Starting up lamboot on a single machine and running mpirun is ok on
> one processor, but stalls on 2.
>
>  
>
>             mpirun –np 1 <program>            running fine
>
>  
>
>             mpirun –np 2 <program>            stalls at first or
> second MPI_Send entry
>
>  
>
> The strange thing is that booting two machines with a hostfile like:
>
>  
>
> aqnode03
>
> aqnode04
>
>  
>
> Now running on 2 cpu’s is going fine (one on each machine). Running on
> 4 or 1 cpu’s is also ok, but now the program if I try to run it on 3
> cpu’s.
>
>  
>
> The hostfile should normally be specified as:
>
>  
>
> aqnode03 cpu=2
>
> aqnode04 cpu=2
>
>  
>
> Since each node has two cpu’s. Booting lam with this option results in
> a lot of stalls. Only way one can run the program is on 1 cpu. The
> hostfile without cpu specification works well, running mpirun -np 4
> will run the program efficiently on all 4 cpu’s.
>
>  
>
> The problem is hardly program specific, since we are running the same
> program on two other machines (Opteron running Fedora Core 2). On this
> machines also the cpu options in the hostfile is working well.
>
>  
>
> Hopefully there is someone out there to answer my most confusing
> questions.
>
>  
>
> regards
>
>  
>
> Atle Svandal
>
>  
>
> Institutt for Fysikk og Teknologi
>
> Universitetet i Bergen
>
> Allegaten 55 - 5007 Bergen
>
> tlf: 55 58 32 58 
>
>  
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/