LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jordan Dawe (jdawe_at_[hidden])
Date: 2004-11-12 16:40:38


Hi all, newbie question here. I'm in the process of setting up a
dual-opteron 64-bit gentoo-based diskless computational cluster. I'm
having a weird problem and I am wondering what the best approach would
be to debugging it, or if people have seen something similar before.

So here's the situation. recon shows no errors and says everything
looks fine. lamboot runs without problem. Running our code on 2
processors, one node works fine. Trying to run the code across 2 nodes,
however, causes a near instant crash with a "process returned Signal 11"
error--it displays the first printf of the model initialization and
then dies. This is the case if we try to run with 2 or with 4
processors across the nodes. This problem occurs using both gcc and the
Portland Group's pgcc, except that with pgcc the crash takes nearly 2
seconds to occur.

Furthermore, I compiled a simple mpi test program that simply passes a
conuter around each CPU and decrements it each time it passes it, and it
ran fine on 4 CPUs across 2 nodes. Thus, I'm guessing this is not
neccessarily an MPI problem, but may be something strange our code is
doing.

Any suggestions? I have no idea how to debug an MPI program, so even
the most basic help or pointers would be welcome.

Jordan Dawe