LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Lily Li (Lily.Li_at_[hidden])
Date: 2006-10-31 12:52:17


We did use 'strace ' to follow both lamd and lamnodes commands. We found
that lamnodes did communicate with lamd about the request, and they
exchanged a few messages, and then lamnodes hang waiting for lamd's
response. Meanwhile, lamd got the request from lamnodes, but went back
to normal select and ignore the request from lamnodes.

We don't use lamd for communication in the job and use it only to load
the jobs. It maybe caused by multiple requests from many mpirun and
lamnodes concurrently running on the node.

Lily

-----Original Message-----
From: Bogdan Costescu [mailto:Bogdan.Costescu_at_[hidden]]
Sent: Thursday, October 26, 2006 9:47 AM
To: General LAM/MPI mailing list
Subject: Re: LAM: Do we need to recompile LAM and applications after
weupgrade the linux kernel ?

On Thu, 26 Oct 2006, Lily Li wrote:

> we start having a higher rate of lamd hanging problem on the
> headnode. The lamd will not response to the command "lamnodes" after
> the LAM is booted and used for couple of days.

This is a bit vague description of the problem. Have you done anything
to diagnose why the lamd would not respond anymore ? For example, have
you tried attaching to the "hung" lamd with gdb or using 'strace -p'
to know what the process is actually doing ?

> do we need to recompile/link the LAM and the applications after we
> upgrade the linux kernel ?

No. Especially with kernels from an enterprise class Linux
distribution which should not change too much between updates.

-- 
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_[hidden]