LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-07-15 22:26:37


On Wed, 14 Jul 2004, Gkikas Magiorkinis wrote:

> I am trying to install the LAM on a 4 PC-Linux cluster but it seems to
> be hanging and not working at all.
>
> More specifically, The cluster is composed of:
>
> 3* PC Intel P4 2.88 Mhz, 256 MB RAM, 20 GB HD, 1Gbit LAN
> 1* PC Intel P4 2.88 Mhz, 512 MB RAM, 120 GB HD, 1Gbit LAN
> Red Hat Linux 9 (Shrike)
>
> I downloaded and installed the LAM rpm v 7.0.6 in each node. I use the
> rsh (not the ssh) for communication among the nodes. I made a bhost file
> and lambooted at the verbose mode and showed that there was not any
> problem in lambooting all the nodes. I tried to run the Test Suite but
> it seemed to hang after a little while. The <Ctrl+C> didn't work so i
> shutted down the session by force. I logged in again, but trying to
> lamhalt failed, so i used wipe <bhost>.
>
> Thinking that there is communication problem I tried to tping all the
> nodes and it seemed to hang after the third ping. The simple ping
> doesn't seem to have any problem at all. So there is no obvious problem
> in the networking of the nodes. On the other hand, tpinging from each
> node itself (using the h option) doesn't have any problem at all.
> Tpinging each node another node hangs after the third ping. <Ctrl+C>
> does not respond and i have to close the shell session. When loging in
> again, i see that lamhalt does not stops the lam deamons at the pinging
> nodes. The switch indicates (blinks) that these two nodes are networking
> on something?. The only way to stop the networking and lam daemons is to
> use the wipe command.

It doesn't sound like this is the case, but it's worth checking: ensuring
that all firewalling software in RH9 is disabled.

It sounds quite fishy that it works for a little while and then dies /
freezes. When tping freezes, can you see if the lamd's are still running
on all nodes?

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/