LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jordan Dawe (jdawe_at_[hidden])
Date: 2004-11-13 17:38:27


The nodes are configured without swap. Is this a problem? I've run the
code on the slave node's 2 CPUs alone without swap, and had no
problems. I've got 2GB of RAM on the slave, and the code only uses
~200MB per cpu. I've got swappriority set to 0 in /proc as well.

The LAM test suite passes all tests on the cluster. I'm guessing this
means it's a problem with the particular code I'm running.

How does one start debugging a lam-based parallel program?

Jordan

Damien Hocking wrote:

> This probably isn't directly a LAM or MPI problem. Check out your
> diskless setup. How is swap space arranged for each node? The classic
> problem is to have each client node write to the same swap
> file/partition over NFS. Instant death.
>
> Also, swapping over NFS requires RAM available to initiate, so your
> swap threshhold must be lower so you've got enough spare RAM for NFS
> to work in and allow swapping.
>
> Can you run the LAM test suite on your cluster?
>
> Damien Hocking
>
> Rome wasn’t built in a meeting.
>
>
>
> Jordan Dawe wrote:
>
>> Hi all, newbie question here. I'm in the process of setting up a
>> dual-opteron 64-bit gentoo-based diskless computational cluster. I'm
>> having a weird problem and I am wondering what the best approach
>> would be to debugging it, or if people have seen something similar
>> before.
>>
>> So here's the situation. recon shows no errors and says everything
>> looks fine. lamboot runs without problem. Running our code on 2
>> processors, one node works fine. Trying to run the code across 2
>> nodes, however, causes a near instant crash with a "process returned
>> Signal 11" error--it displays the first printf of the model
>> initialization and then dies. This is the case if we try to run with
>> 2 or with 4 processors across the nodes. This problem occurs using
>> both gcc and the Portland Group's pgcc, except that with pgcc the
>> crash takes nearly 2 seconds to occur.
>>
>> Furthermore, I compiled a simple mpi test program that simply passes
>> a conuter around each CPU and decrements it each time it passes it,
>> and it ran fine on 4 CPUs across 2 nodes. Thus, I'm guessing this is
>> not neccessarily an MPI problem, but may be something strange our
>> code is doing.
>>
>> Any suggestions? I have no idea how to debug an MPI program, so even
>> the most basic help or pointers would be welcome.
>>
>> Jordan Dawe
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/