LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Geoffrey Irving (irving_at_[hidden])
Date: 2005-11-20 18:18:18


On Sun, Nov 20, 2005 at 03:35:21PM -0500, Brian Barrett wrote:
> On Nov 20, 2005, at 3:30 PM, Geoffrey Irving wrote:
>
> > I'm getting a weird deadlock when trying to create a new
> > communicator. I'm running 6 processes on two quad processor
> > machines (4 on 1 and 2 on the other), and trying to create a
> > communicator for the first two processes. I sucessfully create
> > a group a group containing the first two processes (ranks 0 and
> > 1), and then every process calls MPI_Comm_Create (actually the
> > C++ binding). Processes 1 and 2 successfully complete the call
> > and proceed to other communication. Processes 0,3,4,5 never
> > return from the call to MPI_Comm_Create. The deadlock is
> > deterministic, including which processes return and which don't.
> >
> > As far as I can tell I'm passing correct arguments to the functions
> > involved. Unfortunately the set of processes that completes the
> > call doesn't seem to correlate with anything: the new communicator
> > should contain {0,1}, and processes {0,1,2,3} are on the same
> > machine, but {1,2} succeed.
> >
> > The program has executed a bunch of communication before it reaches
> > this point, including allocating other communicators. I'm running
> > lam 7.1.1.
>
> We certainly haven't seen anything like this before. It would be
> useful if you could include a test case or something similar to that
> - it's awful hard to try to duplicate the problem with the
> information you included.

Here you go:

// bug.cpp
#include <mpi.h>
#include <iostream>
using namespace std;

int main(int argc,char* argv[])
{
    MPI_Init(&argc,&argv);

    int rank=MPI::COMM_WORLD.Get_rank();
    static int ranks_01[2]={0,1};
    static int ranks_1234[4]={1,2,3,4};

    MPI::Group group=MPI::COMM_WORLD.Get_group();
    MPI::Group group_01=group.Incl(2,ranks_01);
    MPI::Group group_1234=group.Incl(4,ranks_1234);

    MPI::COMM_WORLD.Create(group_1234);
    cout<<rank<<": "<<"before"<<endl;
    MPI::COMM_WORLD.Create(group.Incl(2,ranks_01));
    cout<<rank<<": "<<"middle"<<endl;
    MPI::COMM_WORLD.Create(group.Incl(4,ranks_1234));
    cout<<rank<<": "<<"after"<<endl;

    MPI_Finalize();
    return 0;
}
// end bug.cpp

Compiling this code on my local machine with gcc-4.0.1 via

    /usr/local/compilers/gcc-4.0.1-x86_64-x86_64/bin/g++ -pthread -o bug bug.cpp -llammpio -llammpi++ -lmpi -llam -lutil -ldl

and then running it on the aforementioned two quad processor
cluster machines produces the following output:

solverc1:bug% mpirun -np 6 bug
0: before
5: before
1: before
2: before
3: before
4: before
2: middle
1: middle

Unfortunately, I can't use mpic++ to compile it because that adds
-L/usr/lib64 which then produces dynamic linking errors when
I try to run it on the cluster machines (no libstdc++.so.5).

The error does not appear when I run it on my local machine, or
when I compile and run it on the solverc1 machine. Here are
machine details:

    local machine:
        Intel(R) Xeon(TM) CPU 3.60GHz
        SUSE 9.3 (x86-64)
        default compiler: gcc 3.3.5
        LAM/MPI: 7.1.1
    cluster machines:
        AMD Opteron(tm) Processor 852
        SUSE something
        default compiler: gcc 3.4.4
        LAM/MPI: 7.0.6

Ah, apparently I'm not always running 7.1.1 (I just noticed that).
I imagine compiling and linking with one version of lam and then
running under a different version is not supported. Still, it
would be nice if there was a slightly better error message in that
case.

Time to go yell at the person who installed lam on the cluster
machines...

Thanks,
Geoffrey