LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-09-11 08:42:40


On Sep 9, 2007, at 4:54 AM, Mahesh Salunkhe wrote:

> I've installed lam-6.5.9-1 .Time being I'm running it on two
> machines.But it seems tcp connection from the
> the other end is not getting established. recon is running
> successfully but lamboot is giving following error.What could be
> the reason?
> (File lamhosts contains two entries:
> master
> 192.168.10.130)
>
> Output of command lamboot is :
> lamboot -v lamhosts
>
> LAM 7.1.1 /MPI 2 C++/ROMIO - Indiana University

Your statements seem to contradict each other. You mention that you
installed LAM 6.5.8 (an *ancient* version!), but your output shows
v7.1.1.

Note the help message below...

> n-1<4664> ssi:boot:base:linear: booting n0 (master)
> n-1<4664> ssi:boot:base:linear: booting n1 (192.168.10.130)
> ----------------------------------------------------------------------
> -------
> The lamboot agent failed to read a message over a socket from the
> newly-booted process. This should not happen (especially since TCP is
> a guaranteed protocol).
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> You should probably check the following:
>
> - Network connectivity: Ensure that messages can be passed reliably
> over TCP using random ports.
> - Environment / PATH settings: Ensure that you are running the same
> version of LAM/MPI on all nodes. Sometimes premature disconnects
> (and therefore this error message) may be caused if mismatched
> versions of LAM are used on different nodes.

Did you check that all nodes will find the same version of LAM by
default?

> - Node health: Ensure that the host where the newly-booted process was
> launched is healthy and still available on the network.
> ----------------------------------------------------------------------
> -------
> n-1<4664> ssi:boot:base:linear: aborted!
> n-1<4670> ssi:boot:base:linear: booting n0 (master)
> n-1<4670> ssi:boot:base:linear: booting n1 ( 192.168.10.130)
> tkill: killing LAM...
> n-1<4670> ssi:boot:base:linear: finished
> lamboot did NOT complete successfully
>
> --
> Regards
> Mahesh
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
Jeff Squyres
Cisco Systems