LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-10-18 02:22:34


On Oct 17, 2004, at 3:18 PM, Warner Yuen wrote:

> I'm trying to get LAM-MPI to work with Myrinet. I read on the mailing
> list that the only hope is to try one of the version on the SVN
> server. So I tried it. but I can't seem to get lamboot to work. When I
> go back and use LAM-7.0.6 it works fine. Any ideas on what's up?
>
> For my configuration, I used:
>
> ./configure --with-rsh=/usr/bin/ssh --prefix=/hpc/tools/lam-gcc-7.2b/
> --with-gm=/opt/gm --with-ib=/usr/mellanox
>
> ---------------------------My lamboot
> error-----------------------------------------
>
> node25:~/hpltest warner$ lamboot -v lammachines
>
> LAM 7.2b1svn10172004/MPI 2 C++/ROMIO - Indiana University
>
> n-1<19290> ssi:boot:base:linear: booting n0 (node25.cluster.private)
> n-1<19290> ssi:boot:base:linear: booting n1 (node26.cluster.private)
> -----------------------------------------------------------------------
> ------
> The lamboot agent failed to read a message over a socket from the
> newly-booted process. This should not happen (especially since TCP is
> a guaranteed protocol).
>
> Please check your network connectivity and ensure that messages can be
> passed reliably over TCP. Additionally, ensure that the host where
> the newly-booted process was launched is healthy and still available
> on the network.
> -----------------------------------------------------------------------
> ------

This is quite an odd error because, as the error message describes, a
read() from a socket failed. This directly means that we were able to
establish a socket, but then sending data across it failed. This
should not happen. We have not substantially changed the bootup
process in quite a while -- there is nothing majorily different between
7.1 and 7.0 in this regard.

Can you double check a few things?

- ensure that you're not mixing the SVN checkout with any
previously-installed versions (particularly on remote nodes)? Mixing
7.0 with 7.1 and/or SVN versions may not work -- I *believe* that
although we didn't change the overall procedure, we did change some of
the data that flows between lamboot and the lamd between some of these
versions -- that could cause the error that you're seeing (e.g., a
premature EOF)

- double check that your TCP connectivity is reliable. I can't imagine
what would cause this to be flaky on macs, but...

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/