LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2003-12-11 09:02:02


On Fri, 5 Dec 2003, Korambath, Prakashan wrote:

> I am finding following error while running the command lamboot -d
> hostfile command through a sun grid engine queue script on a linux
> cluster. The lam version is 7.0.2. Below is the error message. At the
> command line I can execute lamboot with out any problem. I think it has
> got something to do with lam adding some extra options to Sun grid
> engine. Thanks for your hep in resolving this issue.

Hmm. It's not entirely clear what failed here -- I un-munged your output
(somehow the word wrapping got hosed, making your message very difficult
to read), here's what it says:

-----
[snipped]
n0<32556> ssi:boot: Selected boot module rsh
n0<32556> ssi:boot:base: looking for boot schema in following directories:
n0<32556> ssi:boot:base: $TROLLIUSHOME/etc
n0<32556> ssi:boot:base: $LAMHOME/etc
n0<32556> ssi:boot:base: /u/local/mpi/mpi-lam.7.0.2/etc
n0<32556> ssi:boot:base: looking for boot schema file:
n0<32556> ssi:boot:base: /work/29416.1.p02-30m/nodefile
n0<32556> ssi:boot:base: found boot schema: /work/29416.1.p02-30m/nodefile
n0<32556> ssi:boot:rsh: found the following hosts:
n0<32556> ssi:boot:rsh: n0 i02.bwc.ats.ucla.edu (cpu=2)
n0<32556> ssi:boot:rsh: resolved hosts:
n0<32556> ssi:boot:rsh: n0 i02.bwc.ats.ucla.edu --> 10.10.64.2 (origin)
n0<32556> ssi:boot:rsh: starting RTE procs
n0<32556> ssi:boot:base:linear: starting
n0<32556> ssi:boot:base:server: opening server TCP socket
n0<32556> ssi:boot:base:server: opened port 55218
n0<32556> ssi:boot:base:linear: booting n0 (i02.bwc.ats.ucla.edu)
n0<32556> ssi:boot:rsh: starting lamd on (i02.bwc.ats.ucla.edu)
n0<32556> ssi:boot:rsh: starting on n0 (i02.bwc.ats.ucla.edu): hboot -t -c
lam-conf.lamd -d -sessionsuffix sge-29416-0 -I -H 10.10.64.2 -P 55218 -n 0
-o 0
n0<32556> ssi:boot:rsh: launching locally

LAM 7.0.2/MPI 2 C++/ROMIO - Indiana University

n0<32556> ssi:boot:base:linear: Failed to boot n0 (i02.bwc.ats.ucla.edu)
n0<32556> ssi:boot:base:server: closing server socket
n0<32556> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process, and
[snipped]
-----

The thing that is interesting here is that it failed to launch hboot (a
lamboot helper program) on your local node -- it didn't even rsh/ssh
anywhere. Unfortunately, we don't see any error messages indicating *why*
it failed. A few things to check:

- is "hboot" in your path?
- do you have any firewall/port-blocking software running on the machine?

Random note: the 7.0.2 tarballs that were released were accidentally
incomplete (they did not include the Myrinet MPI module). If you have
Myrinet, you might want to upgrade to 7.0.3.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/