LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Gkikas Magiorkinis (gmagi_at_[hidden])
Date: 2004-08-04 05:50:38


Hello again!

 

I have posted before a problem on tping!

 

I have built a mini cluster composed of the following:

 

3* PC Intel P4 2.88 Mhz, 256 MB RAM, 20 GB HD, 1Gbit LAN

1* PC Intel P4 2.88 Mhz, 512 MB RAM, 120 GB HD, 1Gbit LAN

Red Hat Linux 9 (Shrike)

Private network (192.168.1.1-192.168.1.4)

 

As i mentioned before i had a problem with LAM: though the lamboot started
perfectly well

on all nodes and the programs compiled perfectly well when i tried to
execute the programs

they freezed at the very beginning. The network switch was evidencing of
network traffic as

soon as i tried to start the programs and continued even when i closed the
terminal.

Searching for an answer i tried to tping the nodes as following:

Tping from n0 to n0 was perfectly well

Tping from n0 to n1 was freezing at the third ping

Tping from n1 to n1 was perfectly well

Tping from n1 to n0 was freezing at the third ping

As soon as the tping was frozen i closed the terminal. I noticed that the
switch started to

blink only when i started tping and continued to blink till i logged in
again and lamhalted or lambooted

all the nodes.

 

The strange thing is that when i tpinged n1 from n0 (or n0 from n1) once it
worked perfectly well, but the

switch was blinking all the time even though the tping did not freeze. Then
i tpinged once more,

the statistics were perfectly well and the switch continued to blink (during
all this time it did not stop

to blink). Then i tpinged for a third time and it froze. Looks like it has a
bug with memory???

 

Nevertheless the normal ping command works perfectly well, there is no
firewall on my machine,

i have also disabled the iptables. I have formatted and re-set the cluster.
I have built the LAM

using the source code: i compiled the 7.0.6 version and still had the same
problem,

i compiled the 6.5.9 version and still had the same problem. I am using an
nfs folder in order

not to install the packages in every node separately.

 

I really do not know what to do.

 

Is it a hardware problem i should check??? I am using a 3COM 1Gbit 16port
switch and 3COM 1Gbit

LAN adapters.

Are there any daemons that might conflict with lamd???

Is there any network check procedure i could perform???

 

I have compiled also the MPICH1.2.5.2 which works fine but some programs are
claimed to run better with LAM.

 

Please help. Any suggestion would be helpful!