Hello again!
I have posted before a problem on tping!
I have built a mini cluster composed of the following:
3* PC Intel P4 2.88 Mhz, 256 MB RAM, 20 GB HD, 1Gbit LAN
1* PC Intel P4 2.88 Mhz, 512 MB RAM, 120 GB HD, 1Gbit LAN
Red Hat Linux 9 (Shrike)
Private network (192.168.1.1-192.168.1.4)
As i mentioned before i had a problem with LAM: though the lamboot started
perfectly well
on all nodes and the programs compiled perfectly well when i tried to
execute the programs
they freezed at the very beginning. The network switch was evidencing of
network traffic as
soon as i tried to start the programs and continued even when i closed the
terminal.
Searching for an answer i tried to tping the nodes as following:
Tping from n0 to n0 was perfectly well
Tping from n0 to n1 was freezing at the third ping
Tping from n1 to n1 was perfectly well
Tping from n1 to n0 was freezing at the third ping
As soon as the tping was frozen i closed the terminal. I noticed that the
switch started to
blink only when i started tping and continued to blink till i logged in
again and lamhalted or lambooted
all the nodes.
The strange thing is that when i tpinged n1 from n0 (or n0 from n1) once it
worked perfectly well, but the
switch was blinking all the time even though the tping did not freeze. Then
i tpinged once more,
the statistics were perfectly well and the switch continued to blink (during
all this time it did not stop
to blink). Then i tpinged for a third time and it froze. Looks like it has a
bug with memory???
Nevertheless the normal ping command works perfectly well, there is no
firewall on my machine,
i have also disabled the iptables. I have formatted and re-set the cluster.
I have built the LAM
using the source code: i compiled the 7.0.6 version and still had the same
problem,
i compiled the 6.5.9 version and still had the same problem. I am using an
nfs folder in order
not to install the packages in every node separately.
I really do not know what to do.
Is it a hardware problem i should check??? I am using a 3COM 1Gbit 16port
switch and 3COM 1Gbit
LAN adapters.
Are there any daemons that might conflict with lamd???
Is there any network check procedure i could perform???
I have compiled also the MPICH1.2.5.2 which works fine but some programs are
claimed to run better with LAM.
Please help. Any suggestion would be helpful!
|