Hello everyone.
I am currently having segmentation fault issues when programming with MPI in
C++ and I'm hoping someone might be able to help me out. It only happens when
I am using greater than 2 processors. It also only happens when a.) the
communication is relatively small and b.) must be done many times.
Here are the two functions which use communication. I have removed/renamed
some variables and some "new" statements so as to make the meat of the problem
more viewable.
void TransmitTree(_Split* Tree)
{
MPI_Send(&Num, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
MPI_Send(A, Num, MPI_FLOAT, 0, 2, MPI_COMM_WORLD);
MPI_Send(B, Num, MPI_SHORT, 0, 3, MPI_COMM_WORLD);
MPI_Send(C, Num, MPI_INT, 0, 4, MPI_COMM_WORLD);
}
_Split* ReceiveTree()
{
MPI_Status Status;
MPI_Recv(&Num, 1, MPI_INT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &Status);
MPI_Recv(A, Num, MPI_FLOAT, Status.MPI_SOURCE, 2, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
MPI_Recv(B, Num, MPI_SHORT, Status.MPI_SOURCE, 3, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
MPI_Recv(C, Num, MPI_INT, Status.MPI_SOURCE, 4, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
}
And here is how the function is used:
if (myrank==0)
for(x=myNumTrees;x<Options->NumberOfTrees;x++)
Trees[x]=ReceiveTree();
else
for(x=0;x<myNumTrees;x++)
TransmitTree(Trees[x]);
Each processor has the number of "trees" stored in myNumTrees and
Options->NumberOfTrees is the total number of trees across all CPUs. Each CPU other than
rank0 must transfer all its trees to rank0.
Because of the preconditions for failure I thought maybe their were some
buffer overflow issues (with rank0) and as a result tried to stick in MPI_Ssend
instead of MPI_Send. I have tried to have all of them do either MPI_Irecv or
MPI_Isend followed by an MPI_Wait. I just can't seem to get it right.
The errors look like this, but they vary on the exact place of the error and
so I can't track it down much further using "cout." It appears to me that
rank0 died, but I cannot figure out why:
mpirun -np 6 ./DT
MPI_Send: process in local group is dead (rank 3, MPI_COMM_WORLD)
Rank (3, MPI_COMM_WORLD): Call stack within LAM:
Rank (3, MPI_COMM_WORLD): - MPI_Send()
Rank (3, MPI_COMM_WORLD): - main()
MPI_Send: process in local group is dead (rank 5, MPI_COMM_WORLD)
Rank (5, MPI_COMM_WORLD): Call stack within LAM:
Rank (5, MPI_COMM_WORLD): - MPI_Send()
Rank (5, MPI_COMM_WORLD): - main()
MPI_Send: process in local group is dead (rank 4, MPI_COMM_WORLD)
Rank (4, MPI_COMM_WORLD): Call stack within LAM:
Rank (4, MPI_COMM_WORLD): - MPI_Send()
Rank (4, MPI_COMM_WORLD): - main()
MPI_Send: process in local group is dead (rank 1, MPI_COMM_WORLD)
MPI_Send: process in local group is dead (rank 2, MPI_COMM_WORLD)
Rank (1, MPI_COMM_WORLD): Call stack within LAM:
Rank (1, MPI_COMM_WORLD): - MPI_Send()
Rank (1, MPI_COMM_WORLD): - main()
Rank (2, MPI_COMM_WORLD): Call stack within LAM:
Rank (2, MPI_COMM_WORLD): - MPI_Send()
Rank (2, MPI_COMM_WORLD): - main()
Could someone please explain what I may be doing wrong?
Thank you very much for your time,
Robert
|