LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-05-31 20:45:41


Are you able to do a "tping -c 3" when this occurs? I'm wondering if
your LAM daemons are somehow locking up. More specifically, if you can
login to the first node where processes from your second job have
failed to start, can you run the command "lamnodes"? This command
connects to the local lamd and queries some information from it -- it
will verify if the lamd on that node is still functioning properly.

If it responds, then that lamd is working fine. If it does not
respond, then the lamd has gone catatonic for some reason (verify that
it's still actually running).

Let me know what you find.

On May 27, 2005, at 6:19 PM, Riju John wrote:

> Hi Jeff,
>
> What I noticed was that mpirun starts, the process starts on the first
> few nodes, but not on the other remaining nodes. Those nodes only have
> lamd running, and no processes.
>
> Thanks,
> Riju
>
> On 5/27/05, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>> On May 27, 2005, at 5:32 PM, Riju John wrote:
>>
>>> I am running lam-7.0 on a Opteron cluster running SuSe 8.1.
>>>
>>> I noticed that mpirun sometimes hangs when running multiple MPI jobs.
>>> These jobs run on 64 slave nodes, and keep the system resources
>>> fairly
>>> busy. The first job is doing a fair amount of disk i/o when the
>>> second
>>> job starts. The second job sometimes hangs. This happens before even
>>> getting to MPI_init. Has anyone seen this kind of problem before. Is
>>> there any option in mpirun that can help with this problem.
>>
>> No, I have not seen this before. Do you know if the processes start
>> at
>> all? I.e., do they reach main()? Or are they stuck somewhere between
>> the beginning of main() and the beginning of MPI_INIT()?
>>
>> --
>> {+} Jeff Squyres
>> {+} jsquyres_at_[hidden]
>> {+} http://www.lam-mpi.org/
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/