LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Wouter Brok (wjmb_at_[hidden])
Date: 2003-09-11 08:27:07


Hi,

Here is the output of `lamboot -h lamhosts'. Does anyone have a
suggestion on how to proceed with this?

=======================================================================
hboot: process schema = "/usr/local/etc/lam-conf.lam"
hboot: found /usr/local/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/local/bin/lamd
[1] 12846 lamd -H 131.155.111.159 -P 21081 -n 0 -o 0 -d
lamboot: attempting to execute "ssh -x 131.155.113.140 -n echo $SHELL"
lamboot: got remote shell /bin/bash
lamboot: attempting to execute "ssh -x 131.155.113.140 -n hboot -t -c lam-conf.lam -d -s -I "-H 131.155.111.159 -P 21081 -n 1 -o 0 ""
hboot: process schema = "/etc/opt/lam/lam-conf.lam"

LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre Dame

lamboot: boot schema file: lamhosts
lamboot: opening hostfile lamhosts
lamboot: found the following hosts:
lamboot: n0 131.155.111.159
lamboot: n1 131.155.113.140
lamboot: resolved hosts:
lamboot: n0 131.155.111.159 --> 131.155.111.159
lamboot: n1 131.155.113.140 --> 131.155.113.140
lamboot: found 2 host node(s)
lamboot: origin node is 0 (131.155.111.159)
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -I " -H 131.155.111.159 -P 21081 -n 0 -o 0 ""
hboot: found /opt/lam/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /opt/lam/bin/lamd
[1] 8819 lamd -H 131.155.111.159 -P 21081 -n 1 -o 0 -d
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------

LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre Dame

Executing tkill on n0 (131.155.111.159)...
Executing tkill on n1 (131.155.113.140)...
wipe ...
lamboot did NOT complete successfully
=======================================================================

As you can see I have two hosts in my lamhosts file. LAMRSH is set to
"ssh -x". Furthermore, one (n0) is a SuSE 7.3 machine, the other (n1) a
SuSE 8.1 machine.

As a test I run lamboot with a hostfile containing only the ip-address
of the local machine. On the first machine (n0 above) this worked just
fine; doing the same on the second machine, I got:

=======================================================================
LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre Dame

lamboot: boot schema file: lamhosts
lamboot: opening hostfile lamhosts
lamboot: found the following hosts:
lamboot: n0 131.155.113.140
lamboot: resolved hosts:
lamboot: n0 131.155.113.140 --> 131.155.113.140
lamboot: found 1 host node(s)
lamboot: origin node is 0 (131.155.113.140)
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -I " -H 131.155.113.140 -P 32920 -n 0 -o 0 ""
hboot: process schema = "/etc/opt/lam/lam-conf.lam"
hboot: found /opt/lam/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /opt/lam/bin/lamd
[1] 8964 lamd -H 131.155.113.140 -P 32920 -n 0 -o 0 -d
hboot: attempting to execute
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
lamwipe ...
-----------------------------------------------------------------------------
*** Oops -- cannot find the help that you're supposed to get.
*** Using the following help file:
***
*** /etc/opt/lam/lam-helpfile
***
*** You were supposed to get help on the program "boot"
*** about the topic "wipe-fail"
*** But it doesn't seem to be in that file.
***
*** Sorry!
-----------------------------------------------------------------------------
=======================================================================

Subsequently trying
  
  hboot -t -c lam-conf.lam -d -I " -H 131.155.113.140 -P 32920 -n 0 -o 0 "

gave me:

=======================================================================
hboot: process schema = "/etc/opt/lam/lam-conf.lam"
hboot: found /opt/lam/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /opt/lam/bin/lamd
[1] 8980 lamd -H 131.155.113.140 -P 32920 -n 0 -o 0 -d
hboot: attempting to execute
dli_inet (sfh_sock_open_clt_inet_stm): Connection refused
=======================================================================

Any ideas?

Thanks,

Wouter.