Gkikas,
Have tried running a TCP/IP test (e.g. TTCP, ping, traceroute) between the
various nodes, to make sure that you don't have an intermittent network
problem. This should indicate whether or not the problem is really a LAM
issue or a hardware on.
Regards
Neil Storer
Gkikas Magiorkinis wrote:
> Hello again!
>
>
>
> I have posted before a problem on tping!
>
>
>
> I have built a mini cluster composed of the following:
>
>
>
> 3* PC Intel P4 2.88 Mhz, 256 MB RAM, 20 GB HD, 1Gbit LAN
>
> 1* PC Intel P4 2.88 Mhz, 512 MB RAM, 120 GB HD, 1Gbit LAN
>
> Red Hat Linux 9 (Shrike)
>
> Private network (192.168.1.1-192.168.1.4)
>
>
>
> As i mentioned before i had a problem with LAM: though the lamboot
> started perfectly well
>
> on all nodes and the programs compiled perfectly well when i tried to
> execute the programs
>
> they freezed at the very beginning. The network switch was evidencing of
> network traffic as
>
> soon as i tried to start the programs and continued even when i closed
> the terminal.
>
> Searching for an answer i tried to tping the nodes as following:
>
> Tping from n0 to n0 was perfectly well
>
> Tping from n0 to n1 was freezing at the third ping
>
> Tping from n1 to n1 was perfectly well
>
> Tping from n1 to n0 was freezing at the third ping
>
> As soon as the tping was frozen i closed the terminal. I noticed that
> the switch started to
>
> blink only when i started tping and continued to blink till i logged in
> again and lamhalted or lambooted
>
> all the nodes.
>
>
>
> The strange thing is that when i tpinged n1 from n0 (or n0 from n1) once
> it worked perfectly well, but the
>
> switch was blinking all the time even though the tping did not freeze.
> Then i tpinged once more,
>
> the statistics were perfectly well and the switch continued to blink
> (during all this time it did not stop
>
> to blink). Then i tpinged for a third time and it froze. Looks like it
> has a bug with memory???
>
>
>
> Nevertheless the normal ping command works perfectly well, there is no
> firewall on my machine,
>
> i have also disabled the iptables. I have formatted and re-set the
> cluster. I have built the LAM
>
> using the source code: i compiled the 7.0.6 version and still had the
> same problem,
>
> i compiled the 6.5.9 version and still had the same problem. I am using
> an nfs folder in order
>
> not to install the packages in every node separately.
>
>
>
> I really do not know what to do.
>
>
>
> Is it a hardware problem i should check??? I am using a 3COM 1Gbit
> 16port switch and 3COM 1Gbit
>
> LAN adapters.
>
> Are there any daemons that might conflict with lamd???
>
> Is there any network check procedure i could perform???
>
>
>
> I have compiled also the MPICH1.2.5.2 which works fine but some programs
> are claimed to run better with LAM.
>
>
>
> Please help. Any suggestion would be helpful!
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
+-----------------+---------------------------------+------------------+
| Neil Storer | Head: Systems S/W Section | Operations Dept. |
+-----------------+---------------------------------+------------------+
| ECMWF, | email: neil.storer_at_[hidden] | //=\\ //=\\ |
| Shinfield Park, | Tel: (+44 118) 9499353 | // \\// \\ |
| Reading, | (+44 118) 9499000 x 2353 | ECMWF |
| Berkshire, | Fax: (+44 118) 9869450 | ECMWF |
| RG2 9AX, | | \\ //\\ // |
| UK | URL: http://www.ecmwf.int/ | \\=// \\=// |
+--+--------------+---------------------------------+----------------+-+
| ECMWF is the European Centre for Medium-Range Weather Forecasts |
+-----------------------------------------------------------------+
|