LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: USFResearch_at_[hidden]
Date: 2003-10-07 20:31:06


Hello everyone.

I am currently having segmentation fault issues when programming with MPI in
C++ and I'm hoping someone might be able to help me out. It only happens when
I am using greater than 2 processors. It also only happens when a.) the
communication is relatively small and b.) must be done many times.

Here are the two functions which use communication. I have removed/renamed
some variables and some "new" statements so as to make the meat of the problem
more viewable.

void TransmitTree(_Split* Tree)
{
  MPI_Send(&Num, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);

  MPI_Send(A, Num, MPI_FLOAT, 0, 2, MPI_COMM_WORLD);
  MPI_Send(B, Num, MPI_SHORT, 0, 3, MPI_COMM_WORLD);
  MPI_Send(C, Num, MPI_INT, 0, 4, MPI_COMM_WORLD);
}

_Split* ReceiveTree()
{
  MPI_Status Status;

  MPI_Recv(&Num, 1, MPI_INT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &Status);

  MPI_Recv(A, Num, MPI_FLOAT, Status.MPI_SOURCE, 2, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
  MPI_Recv(B, Num, MPI_SHORT, Status.MPI_SOURCE, 3, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
  MPI_Recv(C, Num, MPI_INT, Status.MPI_SOURCE, 4, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
}

And here is how the function is used:

      if (myrank==0)
        for(x=myNumTrees;x<Options->NumberOfTrees;x++)
          Trees[x]=ReceiveTree();
      else
        for(x=0;x<myNumTrees;x++)
          TransmitTree(Trees[x]);

Each processor has the number of "trees" stored in myNumTrees and
Options->NumberOfTrees is the total number of trees across all CPUs. Each CPU other than
rank0 must transfer all its trees to rank0.

Because of the preconditions for failure I thought maybe their were some
buffer overflow issues (with rank0) and as a result tried to stick in MPI_Ssend
instead of MPI_Send. I have tried to have all of them do either MPI_Irecv or
MPI_Isend followed by an MPI_Wait. I just can't seem to get it right.

The errors look like this, but they vary on the exact place of the error and
so I can't track it down much further using "cout." It appears to me that
rank0 died, but I cannot figure out why:

mpirun -np 6 ./DT

MPI_Send: process in local group is dead (rank 3, MPI_COMM_WORLD)
Rank (3, MPI_COMM_WORLD): Call stack within LAM:
Rank (3, MPI_COMM_WORLD): - MPI_Send()
Rank (3, MPI_COMM_WORLD): - main()
MPI_Send: process in local group is dead (rank 5, MPI_COMM_WORLD)
Rank (5, MPI_COMM_WORLD): Call stack within LAM:
Rank (5, MPI_COMM_WORLD): - MPI_Send()
Rank (5, MPI_COMM_WORLD): - main()
MPI_Send: process in local group is dead (rank 4, MPI_COMM_WORLD)
Rank (4, MPI_COMM_WORLD): Call stack within LAM:
Rank (4, MPI_COMM_WORLD): - MPI_Send()
Rank (4, MPI_COMM_WORLD): - main()
MPI_Send: process in local group is dead (rank 1, MPI_COMM_WORLD)
MPI_Send: process in local group is dead (rank 2, MPI_COMM_WORLD)
Rank (1, MPI_COMM_WORLD): Call stack within LAM:
Rank (1, MPI_COMM_WORLD): - MPI_Send()
Rank (1, MPI_COMM_WORLD): - main()
Rank (2, MPI_COMM_WORLD): Call stack within LAM:
Rank (2, MPI_COMM_WORLD): - MPI_Send()
Rank (2, MPI_COMM_WORLD): - main()

Could someone please explain what I may be doing wrong?

Thank you very much for your time,
Robert