LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Damien Hocking (damien_at_[hidden])
Date: 2004-11-12 16:58:52


This probably isn't directly a LAM or MPI problem. Check out your
diskless setup. How is swap space arranged for each node? The classic
problem is to have each client node write to the same swap
file/partition over NFS. Instant death.

Also, swapping over NFS requires RAM available to initiate, so your swap
threshhold must be lower so you've got enough spare RAM for NFS to work
in and allow swapping.

Can you run the LAM test suite on your cluster?

Damien Hocking

Rome wasn’t built in a meeting.

Jordan Dawe wrote:

> Hi all, newbie question here. I'm in the process of setting up a
> dual-opteron 64-bit gentoo-based diskless computational cluster. I'm
> having a weird problem and I am wondering what the best approach would
> be to debugging it, or if people have seen something similar before.
>
> So here's the situation. recon shows no errors and says everything
> looks fine. lamboot runs without problem. Running our code on 2
> processors, one node works fine. Trying to run the code across 2
> nodes, however, causes a near instant crash with a "process returned
> Signal 11" error--it displays the first printf of the model
> initialization and then dies. This is the case if we try to run with 2
> or with 4 processors across the nodes. This problem occurs using both
> gcc and the Portland Group's pgcc, except that with pgcc the crash
> takes nearly 2 seconds to occur.
>
> Furthermore, I compiled a simple mpi test program that simply passes a
> conuter around each CPU and decrements it each time it passes it, and
> it ran fine on 4 CPUs across 2 nodes. Thus, I'm guessing this is not
> neccessarily an MPI problem, but may be something strange our code is
> doing.
>
> Any suggestions? I have no idea how to debug an MPI program, so even
> the most basic help or pointers would be welcome.
>
> Jordan Dawe
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/