LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Gkikas Magiorkinis (gmagi_at_[hidden])
Date: 2004-07-14 09:06:34


Hi!

 

I am a newbie at the Parallel computer platforms.

I am trying to install the LAM on a 4 PC-Linux cluster but it seems to be
hanging and not working at all.

 

More specifically,

 

The cluster is composed of:

3* PC Intel P4 2.88 Mhz, 256 MB RAM, 20 GB HD, 1Gbit LAN

1* PC Intel P4 2.88 Mhz, 512 MB RAM, 120 GB HD, 1Gbit LAN

Red Hat Linux 9 (Shrike)

 

I downloaded and installed the LAM rpm v 7.0.6 in each node.

I use the rsh (not the ssh) for communication among the nodes.

I made a bhost file and lambooted at the verbose mode and showed that

there was not any problem in lambooting all the nodes. I tried to run the
Test Suite

but it seemed to hang after a little while. The <Ctrl+C> didn't work so i
shutted down the

session by force. I logged in again, but trying to lamhalt failed,

so i used wipe <bhost>.

 

Thinking that there is communication problem I tried to tping all the nodes
and it seemed to hang after

the third ping. The simple ping doesn't seem to have any problem at all. So
there is no obvious problem

in the networking of the nodes. On the other hand,

tpinging from each node itself (using the h option) doesn't have any problem
at all. Tpinging each node

another node hangs after the third ping. <Ctrl+C> does not respond and i
have to

close the shell session. When loging in again, i see that lamhalt does not
stops the lam

deamons at the pinging nodes. The switch indicates (blinks) that these two
nodes are

networking on something?. The only way to stop the networking and lam
daemons is

to use the wipe command.

 

Do you have any idea on how to fix this problem?

 

 

Thank you in advance.