Hi!
I am a newbie at the Parallel computer platforms.
I am trying to install the LAM on a 4 PC-Linux cluster but it seems to be
hanging and not working at all.
More specifically,
The cluster is composed of:
3* PC Intel P4 2.88 Mhz, 256 MB RAM, 20 GB HD, 1Gbit LAN
1* PC Intel P4 2.88 Mhz, 512 MB RAM, 120 GB HD, 1Gbit LAN
Red Hat Linux 9 (Shrike)
I downloaded and installed the LAM rpm v 7.0.6 in each node.
I use the rsh (not the ssh) for communication among the nodes.
I made a bhost file and lambooted at the verbose mode and showed that
there was not any problem in lambooting all the nodes. I tried to run the
Test Suite
but it seemed to hang after a little while. The <Ctrl+C> didn't work so i
shutted down the
session by force. I logged in again, but trying to lamhalt failed,
so i used wipe <bhost>.
Thinking that there is communication problem I tried to tping all the nodes
and it seemed to hang after
the third ping. The simple ping doesn't seem to have any problem at all. So
there is no obvious problem
in the networking of the nodes. On the other hand,
tpinging from each node itself (using the h option) doesn't have any problem
at all. Tpinging each node
another node hangs after the third ping. <Ctrl+C> does not respond and i
have to
close the shell session. When loging in again, i see that lamhalt does not
stops the lam
deamons at the pinging nodes. The switch indicates (blinks) that these two
nodes are
networking on something?. The only way to stop the networking and lam
daemons is
to use the wipe command.
Do you have any idea on how to fix this problem?
Thank you in advance.
|