LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2003-09-11 08:34:27


I have a few recommendations:

- If possible, upgrade to 7.0. 7.0 includes a *lot* more debugging
  output during the lamboot process, specifically to handle situations
  like this where the 6.5 series didn't really indicate what went
  wrong.

- If you can't upgrade to 7.0, try the following random things:

  - Ensure that /tmp is writable

  - Ensure that the LAM you have installed is compiled for that
    machine's OS (I'm not familiar with SuSE, but I'm guessing that
    something compiled for SuSE 7.3 may not run properly on a SuSE 8.1
    machine, and vice versa)

  - Ensure that you have the same version of LAM installed on both
    machines.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
On Thu, 11 Sep 2003, Wouter Brok wrote:
> Hi,
>
> Here is the output of `lamboot -h lamhosts'. Does anyone have a
> suggestion on how to proceed with this?
>
>
> =======================================================================
> hboot: process schema = "/usr/local/etc/lam-conf.lam"
> hboot: found /usr/local/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/local/bin/lamd
> [1]  12846 lamd -H 131.155.111.159 -P 21081 -n 0 -o 0 -d
> lamboot: attempting to execute "ssh -x 131.155.113.140 -n echo $SHELL"
> lamboot: got remote shell /bin/bash
> lamboot: attempting to execute "ssh -x 131.155.113.140 -n hboot -t -c lam-conf.lam -d -s -I "-H 131.155.111.159 -P 21081 -n 1 -o 0    ""
> hboot: process schema = "/etc/opt/lam/lam-conf.lam"
>
> LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre Dame
>
> lamboot: boot schema file: lamhosts
> lamboot: opening hostfile lamhosts
> lamboot: found the following hosts:
> lamboot:   n0 131.155.111.159
> lamboot:   n1 131.155.113.140
> lamboot: resolved hosts:
> lamboot:   n0 131.155.111.159 --> 131.155.111.159
> lamboot:   n1 131.155.113.140 --> 131.155.113.140
> lamboot: found 2 host node(s)
> lamboot: origin node is 0 (131.155.111.159)
> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -I " -H 131.155.111.159 -P 21081 -n 0 -o 0     ""
> hboot: found /opt/lam/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /opt/lam/bin/lamd
> [1]   8819 lamd -H 131.155.111.159 -P 21081 -n 1 -o 0 -d
> -----------------------------------------------------------------------------
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
> -----------------------------------------------------------------------------
>
> LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre Dame
>
> Executing tkill on n0 (131.155.111.159)...
> Executing tkill on n1 (131.155.113.140)...
> wipe ...
> lamboot did NOT complete successfully
> =======================================================================
>
>
> As you can see I have two hosts in my lamhosts file. LAMRSH is set to
> "ssh -x". Furthermore, one (n0) is a SuSE 7.3 machine, the other (n1) a
> SuSE 8.1 machine.
>
> As a test I run lamboot with a hostfile containing only the ip-address
> of the local machine. On the first machine (n0 above) this worked just
> fine; doing the same on the second machine, I got:
>
>
> =======================================================================
> LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre Dame
>
> lamboot: boot schema file: lamhosts
> lamboot: opening hostfile lamhosts
> lamboot: found the following hosts:
> lamboot:   n0 131.155.113.140
> lamboot: resolved hosts:
> lamboot:   n0 131.155.113.140 --> 131.155.113.140
> lamboot: found 1 host node(s)
> lamboot: origin node is 0 (131.155.113.140)
> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -I " -H 131.155.113.140 -P 32920 -n 0 -o 0     ""
> hboot: process schema = "/etc/opt/lam/lam-conf.lam"
> hboot: found /opt/lam/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /opt/lam/bin/lamd
> [1]   8964 lamd -H 131.155.113.140 -P 32920 -n 0 -o 0 -d
> hboot: attempting to execute
> -----------------------------------------------------------------------------
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
> -----------------------------------------------------------------------------
> lamwipe ...
> -----------------------------------------------------------------------------
> *** Oops -- cannot find the help that you're supposed to get.
> *** Using the following help file:
> ***
> ***    /etc/opt/lam/lam-helpfile
> ***
> *** You were supposed to get help on the program "boot"
> *** about the topic "wipe-fail"
> *** But it doesn't seem to be in that file.
> ***
> *** Sorry!
> -----------------------------------------------------------------------------
> =======================================================================
>
>
> Subsequently trying
>
>   hboot -t -c lam-conf.lam -d -I " -H 131.155.113.140 -P 32920 -n 0 -o 0     "
>
> gave me:
>
> =======================================================================
> hboot: process schema = "/etc/opt/lam/lam-conf.lam"
> hboot: found /opt/lam/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /opt/lam/bin/lamd
> [1]   8980 lamd -H 131.155.113.140 -P 32920 -n 0 -o 0 -d
> hboot: attempting to execute
> dli_inet (sfh_sock_open_clt_inet_stm): Connection refused
> =======================================================================
>
>
> Any ideas?
>
> Thanks,
>
> Wouter.
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>