LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-11-11 06:36:21


On Nov 12, 2005, at 12:55 AM, Guangyu Wu wrote:

> I could boot lam universe using rsh by “lamboot –v nodes”, but got the
> same error while booting within a PBS job.

If you're in a PBS job and you lamboot with a hostfile, LAM is still
going to use tm and ignore the hostfile unless you specifically disable
the TM boot module. Did you do that?

But other than that, rsh and tm use the same mechanisms to launch
(i.e., communication-wise), so something is odd with your setup if one
works and the other does not, but both are able to actually launch the
lamd's on remote nodes.

> Thanks for your reply! Now it seems I have compiled lam with TM
> enabled.
> But I got an "The lamboot agent timed out while waiting for the
> newly-booted process "error while booting lam within a PBS job.
> The followingmessage in the .e36 file indicates that lam was trying to
> boot via tm.
> n0<16809> ssi:boot:tm: successfully launched on n2 (linux3)
> Attached please find the job script and error output file.
> I didn’t configure any rsh or ssh between the 3 nodes.
> Please could you have a look inside the file and give me some
> suggestions?

I see from your output:

n0<16809> ssi:boot:base:linear_windowed: finished launching
n0<16809> ssi:boot:base:server: expecting connection from finite list
n0<16809> ssi:boot:base:server: got connection from 192.168.40.81
n0<16809> ssi:boot:base:server: this connection is expected (n0)
n0<16809> ssi:boot:base:server: remote lamd is at 192.168.40.81:32782
n0<16809> ssi:boot:base:server: expecting connection from finite list
n0<16809> ssi:boot:base:server: got connection from 56.145.206.0

So lamboot thinks it launched everything and then it got a callback
from the local lamd and that went fine. But then it got a callback
from 56.145.206.0 -- that seems like a pretty strange IP address.
Since you're using 192.168 kinds of addresses, I'm surprised that a
non-private address is calling back, and I'm also surprised that it's a
.0 address. Are you sure that your network setup is correct?

After all this, LAM decides that it hasn't heard from all the other
lamd's in a timely fashion and gives up.

You might want to look in the syslog on the nodes that failed to boot
and see if there are any lamd messages in there (lamboot -d causes the
lamd's to dump messages to the syslog).

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/