The first thing i tried was
$ lamboot -v hostfile
where hostfile is
computer_name cpu=4
it seems like omitting hostfile implies localhost which is fine
$echo $PATH
.:~/bin:~/lam/bin:/opt/intel_cc_80/bin:/opt/intel_fc_80/bin:
/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin
$ lamboot -d hostfile
n0<25967> ssi:boot: Opening
n0<25967> ssi:boot: opening module globus
n0<25967> ssi:boot: initializing module globus
n0<25967> ssi:boot:globus: globus-job-run not found, globus boot will not run
n0<25967> ssi:boot: module not available: globus
n0<25967> ssi:boot: opening module rsh
n0<25967> ssi:boot: initializing module rsh
n0<25967> ssi:boot:rsh: module initializing
n0<25967> ssi:boot:rsh:agent: rsh
n0<25967> ssi:boot:rsh:username: <same>
n0<25967> ssi:boot:rsh:verbose: 1000
n0<25967> ssi:boot:rsh:algorithm: linear
n0<25967> ssi:boot:rsh:priority: 10
n0<25967> ssi:boot: module available: rsh, priority: 10
n0<25967> ssi:boot: finalizing module globus
n0<25967> ssi:boot:globus: finalizing
n0<25967> ssi:boot: closing module globus
n0<25967> ssi:boot: Selected boot module rsh
LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University
n0<25967> ssi:boot:base: looking for boot schema in following directories:
n0<25967> ssi:boot:base: <current directory>
n0<25967> ssi:boot:base: $TROLLIUSHOME/etc
n0<25967> ssi:boot:base: $LAMHOME/etc
n0<25967> ssi:boot:base: /home/ik20/lam/etc
n0<25967> ssi:boot:base: looking for boot schema file:
n0<25967> ssi:boot:base: hostfile
n0<25967> ssi:boot:base: found boot schema: hostfile
n0<25967> ssi:boot:rsh: found the following hosts:
n0<25967> ssi:boot:rsh: n0 tca1 (cpu=4)
n0<25967> ssi:boot:rsh: resolved hosts:
n0<25967> ssi:boot:rsh: n0 tca1 --> 193.62.112.34 (origin)
n0<25967> ssi:boot:rsh: starting RTE procs
n0<25967> ssi:boot:base:linear: starting
n0<25967> ssi:boot:base:server: opening server TCP socket
n0<25967> ssi:boot:base:server: opened port 1330
n0<25967> ssi:boot:base:linear: booting n0 (tca1)
n0<25967> ssi:boot:rsh: starting lamd on (tca1)
n0<25967> ssi:boot:rsh: starting on n0 (tca1): hboot -t -c lam-conf.lamd -d -I -H 193.62.112.34 -P 1330 -n 0 -o 0
n0<25967> ssi:boot:rsh: launching locally
hboot: process schema = "lam-conf.lamd"
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 25970 lamd -H 193.62.112.34 -P 1330 -n 0 -o 0 -d
hboot: attempting to execute
n0<25967> ssi:boot:rsh: successfully launched on n0 (tca1)
n0<25967> ssi:boot:base:server: expecting connection from finite list
n0<25967> ssi:boot:base:server: got connection from 193.62.112.34
n0<25967> ssi:boot:base:server: this connection is expected (n0)
-----------------------------------------------------------------------------
The lamboot agent failed to read a message over a socket from the
newly-booted process. This should not happen (especially since TCP is
a guaranteed protocol).
Please check your network connectivity and ensure that messages can be
passed reliably over TCP. Additionally, ensure that the host where
the newly-booted process was launched is healthy and still available
on the network.
-----------------------------------------------------------------------------
n0<25967> ssi:boot:base:server: failed to connect to remote lamd!
n0<25967> ssi:boot:base:server: closing server socket
n0<25967> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
lamboot: wipe -- nothing to do
lamboot did NOT complete successfully
>
> Can you send across the following:
>
> - The command you invoke for lamboot - how many nodes you are booting on?
> It seems you are just booting on the current node with "lamboot -d" w/o
> any hostfile. Just wanted to confirm this.
>
> - The complete output of "lamboot -d"
>
> - The value of your path environment variable
>
> -Vishal
>
> On Tue, 6 Apr 2004, I Kozin wrote:
>
> #
> # Hello,
> #
> # here is the problem:
> # we've got a 4 processor Intel Itanium2 box and want to
> # use LAM (shared memory environment only).
> #
> # There is already LAM 6.5 installed but it has been created
> # using gcc (v2.95) and I can not link a code compiled using
> # Intel Fortran 8.0 with the existing LAM (MPI function
> # names are not resolved).
> #
> # This is a known problem according to LAM FAQ
> # and the solutions is to rebuild LAM. OK, I downloaded
> # LAM 7.04 and compiled it. Now, I don't want to remove
> # the old LAM because it might be useful if someone wants
> # to use gcc. Instead I decided to install LAM locally
> # in my home directory. I appended the PATH variable
> # so that the new path to LAM overrides the old one.
> # I also pointed LAMHOME to the local dir (just in case).
> #
> # While I could not see any problems during make and
> # install when I run lamboot it returns an error.
> # Although laminfo points to the local dir
> #
> # "lamboot -d" shows
> # ...
> # hboot: found /usr/bin/lamd
> #
> # which it should not. ["which lamd" points to my local dir as well]
> #
> # and after that
> #
> # hboot: performing tkill
> # hboot: tkill
> # hboot: booting...
> # hboot: fork /usr/bin/lamd
> # [1] 25211 lamd -H 127.0.0.1 -P 1324 -n 0 -o 0 -d
> # hboot: attempting to execute
> # n0<25208> ssi:boot:rsh: successfully launched on n0 (localhost)
> # n0<25208> ssi:boot:base:server: expecting connection from finite list
> # n0<25208> ssi:boot:base:server: got connection from 127.0.0.1
> # n0<25208> ssi:boot:base:server: this connection is expected (n0)
> # ----------------------------------------------------------------------------
> # -
> # The lamboot agent failed to read a message over a socket from the
> # newly-booted process. This should not happen (especially since TCP is
> # a guaranteed protocol).
> #
> # Please check your network connectivity and ensure that messages can be
> # passed reliably over TCP. Additionally, ensure that the host where
> # the newly-booted process was launched is healthy and still available
> # on the network.
> # ----------------------------------------------------------------------------
> # -
> # n0<25208> ssi:boot:base:server: failed to connect to remote lamd!
> # n0<25208> ssi:boot:base:server: closing server socket
> # n0<25208> ssi:boot:base:linear: aborted!
> #
> # what is going on?
> # Your help is greatly appreciated!
> #
> # Igor
> #
> # config.log, make.log and make-install.log can be sent on request.
> # _______________________________________________
> # This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> #
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|