Hello,
I'm getting a weird deadlock when trying to create a new communicator. I'm running
6 processes on two quad processor machines (4 on 1 and 2 on the other), and trying to
create a communicator for the first two processes. I sucessfully create a group a
group containing the first two processes (ranks 0 and 1), and then every process calls
MPI_Comm_Create (actually the C++ binding). Processes 1 and 2 successfully complete
the call and proceed to other communication. Processes 0,3,4,5 never return from the
call to MPI_Comm_Create. The deadlock is deterministic, including which processes
return and which don't.
As far as I can tell I'm passing correct arguments to the functions involved.
Unfortunately the set of processes that completes the call doesn't seem to correlate
with anything: the new communicator should contain {0,1}, and processes {0,1,2,3} are
on the same machine, but {1,2} succeed.
The program has executed a bunch of communication before it reaches this point,
including allocating other communicators. I'm running lam 7.1.1.
Thanks,
Geoffrey
|