On Wed, 14 Jul 2004, Gkikas Magiorkinis wrote:
> I am trying to install the LAM on a 4 PC-Linux cluster but it seems to
> be hanging and not working at all.
>
> More specifically, The cluster is composed of:
>
> 3* PC Intel P4 2.88 Mhz, 256 MB RAM, 20 GB HD, 1Gbit LAN
> 1* PC Intel P4 2.88 Mhz, 512 MB RAM, 120 GB HD, 1Gbit LAN
> Red Hat Linux 9 (Shrike)
>
> I downloaded and installed the LAM rpm v 7.0.6 in each node. I use the
> rsh (not the ssh) for communication among the nodes. I made a bhost file
> and lambooted at the verbose mode and showed that there was not any
> problem in lambooting all the nodes. I tried to run the Test Suite but
> it seemed to hang after a little while. The <Ctrl+C> didn't work so i
> shutted down the session by force. I logged in again, but trying to
> lamhalt failed, so i used wipe <bhost>.
>
> Thinking that there is communication problem I tried to tping all the
> nodes and it seemed to hang after the third ping. The simple ping
> doesn't seem to have any problem at all. So there is no obvious problem
> in the networking of the nodes. On the other hand, tpinging from each
> node itself (using the h option) doesn't have any problem at all.
> Tpinging each node another node hangs after the third ping. <Ctrl+C>
> does not respond and i have to close the shell session. When loging in
> again, i see that lamhalt does not stops the lam deamons at the pinging
> nodes. The switch indicates (blinks) that these two nodes are networking
> on something?. The only way to stop the networking and lam daemons is to
> use the wipe command.
It doesn't sound like this is the case, but it's worth checking: ensuring
that all firewalling software in RH9 is disabled.
It sounds quite fishy that it works for a little while and then dies /
freezes. When tping freezes, can you see if the lamd's are still running
on all nodes?
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|