LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-11-13 22:23:39


You should probably check out the LAM FAQ under the section "MPI
Programs under LAM/MPI".

Hope that helps.

On Nov 12, 2004, at 4:40 PM, Jordan Dawe wrote:

> Hi all, newbie question here. I'm in the process of setting up a
> dual-opteron 64-bit gentoo-based diskless computational cluster. I'm
> having a weird problem and I am wondering what the best approach would
> be to debugging it, or if people have seen something similar before.
>
> So here's the situation. recon shows no errors and says everything
> looks fine. lamboot runs without problem. Running our code on 2
> processors, one node works fine. Trying to run the code across 2
> nodes, however, causes a near instant crash with a "process returned
> Signal 11" error--it displays the first printf of the model
> initialization and then dies. This is the case if we try to run with
> 2 or with 4 processors across the nodes. This problem occurs using
> both gcc and the Portland Group's pgcc, except that with pgcc the
> crash takes nearly 2 seconds to occur.
>
> Furthermore, I compiled a simple mpi test program that simply passes a
> conuter around each CPU and decrements it each time it passes it, and
> it ran fine on 4 CPUs across 2 nodes. Thus, I'm guessing this is not
> neccessarily an MPI problem, but may be something strange our code is
> doing.
>
> Any suggestions? I have no idea how to debug an MPI program, so even
> the most basic help or pointers would be welcome.
>
> Jordan Dawe
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/