On May 5, 2005, at 12:59 PM, Lily Li wrote:
> We are having a problem with LAM 7.0 on a linux RedHat 9 cluster
> using ethernet.
>
> The cluster has 128 nodes. The lamboot was successful. But after
> running about a day or so. The lamnodes command starts to hang on the
> first node. All the others seem working just fine. But they report the
> first node as invalid.
>
> The lamd on the first node is still running, just not to respond to
> any lam commands, such as lamnodes, mpitask, etc.
Yikes. That clearly shouldn't happen. :-(
> All nodes have 2 nics, but first one is not configured to have IPs.
This shouldn't be an issue.
> After the LAM is booted for about a day or so, some time, we also see
> a message like:
>
> rcmd: socket: all port in use.
Hum. That's an odd message. I'm not sure that it's from us -- rcmd is
a system-level service, if I recall correctly, and not one that LAM
uses.
> Does this problem sound like a system/firework configuration error or
> a error in lamnodes/lamd.
It *sounds* like a lamd error, but not behavior that we have seen
before. It could also be an OS issue, that somehow inbound network
connections are not actually getting to the lamd.
Unfortunately (or fortunately?), the backtrace simply shows that the
lamd is in its main processing loop -- nothing too strange showing up
there.
The one obvious question I have to ask -- is there any way that you can
upgrade to LAM 7.1.1 and see if you see the same behavior? There have
been a small number of lamd fixes since 7.0.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|