On Sun, Nov 20, 2005 at 03:35:21PM -0500, Brian Barrett wrote:
> On Nov 20, 2005, at 3:30 PM, Geoffrey Irving wrote:
>
> > I'm getting a weird deadlock when trying to create a new
> > communicator. I'm running 6 processes on two quad processor
> > machines (4 on 1 and 2 on the other), and trying to create a
> > communicator for the first two processes. I sucessfully create
> > a group a group containing the first two processes (ranks 0 and
> > 1), and then every process calls MPI_Comm_Create (actually the
> > C++ binding). Processes 1 and 2 successfully complete the call
> > and proceed to other communication. Processes 0,3,4,5 never
> > return from the call to MPI_Comm_Create. The deadlock is
> > deterministic, including which processes return and which don't.
> >
> > As far as I can tell I'm passing correct arguments to the functions
> > involved. Unfortunately the set of processes that completes the
> > call doesn't seem to correlate with anything: the new communicator
> > should contain {0,1}, and processes {0,1,2,3} are on the same
> > machine, but {1,2} succeed.
> >
> > The program has executed a bunch of communication before it reaches
> > this point, including allocating other communicators. I'm running
> > lam 7.1.1.
>
> We certainly haven't seen anything like this before. It would be
> useful if you could include a test case or something similar to that
> - it's awful hard to try to duplicate the problem with the
> information you included.
Here you go:
// bug.cpp
#include <mpi.h>
#include <iostream>
using namespace std;
int main(int argc,char* argv[])
{
MPI_Init(&argc,&argv);
int rank=MPI::COMM_WORLD.Get_rank();
static int ranks_01[2]={0,1};
static int ranks_1234[4]={1,2,3,4};
MPI::Group group=MPI::COMM_WORLD.Get_group();
MPI::Group group_01=group.Incl(2,ranks_01);
MPI::Group group_1234=group.Incl(4,ranks_1234);
MPI::COMM_WORLD.Create(group_1234);
cout<<rank<<": "<<"before"<<endl;
MPI::COMM_WORLD.Create(group.Incl(2,ranks_01));
cout<<rank<<": "<<"middle"<<endl;
MPI::COMM_WORLD.Create(group.Incl(4,ranks_1234));
cout<<rank<<": "<<"after"<<endl;
MPI_Finalize();
return 0;
}
// end bug.cpp
Compiling this code on my local machine with gcc-4.0.1 via
/usr/local/compilers/gcc-4.0.1-x86_64-x86_64/bin/g++ -pthread -o bug bug.cpp -llammpio -llammpi++ -lmpi -llam -lutil -ldl
and then running it on the aforementioned two quad processor
cluster machines produces the following output:
solverc1:bug% mpirun -np 6 bug
0: before
5: before
1: before
2: before
3: before
4: before
2: middle
1: middle
Unfortunately, I can't use mpic++ to compile it because that adds
-L/usr/lib64 which then produces dynamic linking errors when
I try to run it on the cluster machines (no libstdc++.so.5).
The error does not appear when I run it on my local machine, or
when I compile and run it on the solverc1 machine. Here are
machine details:
local machine:
Intel(R) Xeon(TM) CPU 3.60GHz
SUSE 9.3 (x86-64)
default compiler: gcc 3.3.5
LAM/MPI: 7.1.1
cluster machines:
AMD Opteron(tm) Processor 852
SUSE something
default compiler: gcc 3.4.4
LAM/MPI: 7.0.6
Ah, apparently I'm not always running 7.1.1 (I just noticed that).
I imagine compiling and linking with one version of lam and then
running under a different version is not supported. Still, it
would be nice if there was a slightly better error message in that
case.
Time to go yell at the person who installed lam on the cluster
machines...
Thanks,
Geoffrey
|