LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2003-06-14 16:45:33


[snip]

> $ cat grower.x
> #!/bin/bash
> lamboot ~/hostfiles/localhost
> ssh -x sybil echo $SHELL
> lamgrow -v sybil

[snip]

> LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
>
> The ssh shell on sybil
> /bin/csh
> Executing hboot on n1 (sybil - 0 CPU)...
> lamgrow (lambootagent): Connection timed out
> tkill ...

So, I'm going to assume that ~/hostfiles/localhost contains one line, with
the machine name "localhost". This is the cause of the problem - If the
LAM universe is ever going to expand beyond 1 node, localhost (or any name
resolving to 127.0.0.1) can not be listed in any host list (either to
lamboot or lamgrow). While we check for localhost and more than one node
during lamboot, it is very difficult for us to do so in the situation you
are running into. I believe this behavior is documented in 6.5.9 and
definietly is in 7.0.

So, the short answer is that you need to use a hostname for the original
lamboot that resolves to something reachable by sybil.

Hope this helps,

Brian

-- 
  Brian Barrett
  LAM/MPI developer and all around nice guy
  Have a LAM/MPI day: http://www.lam-mpi.org/