On Nov 20, 2005, at 3:30 PM, Geoffrey Irving wrote:
> I'm getting a weird deadlock when trying to create a new
> communicator. I'm running
> 6 processes on two quad processor machines (4 on 1 and 2 on the
> other), and trying to
> create a communicator for the first two processes. I sucessfully
> create a group a
> group containing the first two processes (ranks 0 and 1), and then
> every process calls
> MPI_Comm_Create (actually the C++ binding). Processes 1 and 2
> successfully complete
> the call and proceed to other communication. Processes 0,3,4,5
> never return from the
> call to MPI_Comm_Create. The deadlock is deterministic, including
> which processes
> return and which don't.
>
> As far as I can tell I'm passing correct arguments to the functions
> involved.
> Unfortunately the set of processes that completes the call doesn't
> seem to correlate
> with anything: the new communicator should contain {0,1}, and
> processes {0,1,2,3} are
> on the same machine, but {1,2} succeed.
>
> The program has executed a bunch of communication before it reaches
> this point,
> including allocating other communicators. I'm running lam 7.1.1.
We certainly haven't seen anything like this before. It would be
useful if you could include a test case or something similar to that
- it's awful hard to try to duplicate the problem with the
information you included.
Thanks,
brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|