Hello Everyone,
After I successfully installed lam-mpi-7.1.2, the 'recon' command
has been passed but ' lamboot -d' failed. The cluster has 5
nodes as c0101, c0102, c0103, c0104, c0105 which resolved in /etc/hosts
already. I'm sure lam-bhost.def is fine, and lam-conf.lamd
is left default. I can 'rsh' without passwords any nodes.
Please see below what happened for 'lamboot -d' on node 'c0101'. Thanks a
lot!
[alookdo_at_c0101 alookdo]$ lamboot -d
n-1<27432> ssi:boot:open: opening
n-1<27432> ssi:boot:open: opening boot module globus
n-1<27432> ssi:boot:open: opened boot module globus
n-1<27432> ssi:boot:open: opening boot module rsh
n-1<27432> ssi:boot:open: opened boot module rsh
n-1<27432> ssi:boot:open: opening boot module slurm
n-1<27432> ssi:boot:open: opened boot module slurm
n-1<27432> ssi:boot:select: initializing boot module slurm
n-1<27432> ssi:boot:slurm: not running under SLURM
n-1<27432> ssi:boot:select: boot module not available: slurm
n-1<27432> ssi:boot:select: initializing boot module globus
n-1<27432> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<27432> ssi:boot:select: boot module not available: globus
n-1<27432> ssi:boot:select: initializing boot module rsh
n-1<27432> ssi:boot:rsh: module initializing
n-1<27432> ssi:boot:rsh:agent: rsh
n-1<27432> ssi:boot:rsh:username: <same>
n-1<27432> ssi:boot:rsh:verbose: 1000
n-1<27432> ssi:boot:rsh:algorithm: linear
n-1<27432> ssi:boot:rsh:no_n: 0
n-1<27432> ssi:boot:rsh:no_profile: 0
n-1<27432> ssi:boot:rsh:fast: 0
n-1<27432> ssi:boot:rsh:ignore_stderr: 0
n-1<27432> ssi:boot:rsh:priority: 75
n-1<27432> ssi:boot:select: boot module available: rsh, priority: 75
n-1<27432> ssi:boot:select: finalizing boot module slurm
n-1<27432> ssi:boot:slurm: finalizing
n-1<27432> ssi:boot:select: closing boot module slurm
n-1<27432> ssi:boot:select: finalizing boot module globus
n-1<27432> ssi:boot:globus: finalizing
n-1<27432> ssi:boot:select: closing boot module globus
n-1<27432> ssi:boot:select: selected boot module rsh
n-1<27432> ssi:boot:base: looking for boot schema in following directories:
n-1<27432> ssi:boot:base: <current directory>
n-1<27432> ssi:boot:base: $TROLLIUSHOME/etc
n-1<27432> ssi:boot:base: $LAMHOME/etc
n-1<27432> ssi:boot:base: /export/home/alookdo/lam/etc
n-1<27432> ssi:boot:base: looking for boot schema file:
n-1<27432> ssi:boot:base: lam-bhost.def
n-1<27432> ssi:boot:base: found boot schema:
/export/home/alookdo/lam/etc/lam-bhost.def
n-1<27432> ssi:boot:rsh: found the following hosts:
n-1<27432> ssi:boot:rsh: n0 c0101 (cpu=1)
n-1<27432> ssi:boot:rsh: n1 c0102 (cpu=1)
n-1<27432> ssi:boot:rsh: n2 c0103 (cpu=1)
n-1<27432> ssi:boot:rsh: n3 c0104 (cpu=1)
n-1<27432> ssi:boot:rsh: n4 c0105 (cpu=1)
n-1<27432> ssi:boot:rsh: resolved hosts:
n-1<27432> ssi:boot:rsh: n0 c0101 --> 192.168.1.1 (origin)
n-1<27432> ssi:boot:rsh: n1 c0102 --> 192.168.1.2
n-1<27432> ssi:boot:rsh: n2 c0103 --> 192.168.1.3
n-1<27432> ssi:boot:rsh: n3 c0104 --> 192.168.1.4
n-1<27432> ssi:boot:rsh: n4 c0105 --> 192.168.1.5
n-1<27432> ssi:boot:rsh: starting RTE procs
n-1<27432> ssi:boot:base:linear: starting
n-1<27432> ssi:boot:base:server: opening server TCP socket
n-1<27432> ssi:boot:base:server: opened port 33865
n-1<27432> ssi:boot:base:linear: booting n0 (c0101)
n-1<27432> ssi:boot:rsh: starting lamd on (c0101)
n-1<27432> ssi:boot:rsh: starting on n0 (c0101): hboot -t -c
lam-conf.lamd-d -I -H
192.168.1.1 -P 33865 -n 0 -o 0
n-1<27432> ssi:boot:rsh: launching locally
n-1<27432> ssi:boot:rsh: successfully launched on n0 (c0101)
n-1<27432> ssi:boot:base:server: expecting connection from finite list
n-1<27435> ssi:boot:open: opening
n-1<27435> ssi:boot:open: opening boot module globus
n-1<27435> ssi:boot:open: opened boot module globus
n-1<27435> ssi:boot:open: opening boot module rsh
n-1<27435> ssi:boot:open: opened boot module rsh
n-1<27435> ssi:boot:open: opening boot module slurm
n-1<27435> ssi:boot:open: opened boot module slurm
n-1<27435> ssi:boot:select: initializing boot module slurm
n-1<27435> ssi:boot:slurm: not running under SLURM
n-1<27435> ssi:boot:select: boot module not available: slurm
n-1<27435> ssi:boot:select: initializing boot module globus
n-1<27435> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<27435> ssi:boot:select: boot module not available: globus
n-1<27435> ssi:boot:select: initializing boot module rsh
n-1<27435> ssi:boot:rsh: module initializing
n-1<27435> ssi:boot:rsh:agent: rsh
n-1<27435> ssi:boot:rsh:username: <same>
n-1<27435> ssi:boot:rsh:verbose: 1000
n-1<27435> ssi:boot:rsh:algorithm: linear
n-1<27435> ssi:boot:rsh:no_n: 0
n-1<27435> ssi:boot:rsh:no_profile: 0
n-1<27435> ssi:boot:rsh:fast: 0
n-1<27435> ssi:boot:rsh:ignore_stderr: 0
n-1<27435> ssi:boot:rsh:priority: 75
n-1<27435> ssi:boot:select: boot module available: rsh, priority: 75
n-1<27435> ssi:boot:select: finalizing boot module slurm
n-1<27435> ssi:boot:slurm: finalizing
n-1<27435> ssi:boot:select: closing boot module slurm
n-1<27435> ssi:boot:select: finalizing boot module globus
n-1<27435> ssi:boot:globus: finalizing
n-1<27435> ssi:boot:select: closing boot module globus
n-1<27435> ssi:boot:select: selected boot module rsh
n-1<27435> ssi:boot:send_lamd: getting node ID from command line
n-1<27435> ssi:boot:send_lamd: getting agent haddr from command line
n-1<27435> ssi:boot:send_lamd: getting agent port from command line
n-1<27435> ssi:boot:send_lamd: getting node ID from command line
n-1<27435> ssi:boot:send_lamd: connecting to 192.168.1.1:33865, node id 0
n-1<27435> ssi:boot:send_lamd: sending dli_port 32908
n-1<27432> ssi:boot:base:server: got connection from 192.168.1.1
n-1<27432> ssi:boot:base:server: this connection is expected (n0)
n-1<27432> ssi:boot:base:server: remote lamd is at 192.168.1.1:32908
n-1<27432> ssi:boot:base:linear: booting n1 (c0102)
n-1<27432> ssi:boot:rsh: starting lamd on (c0102)
n-1<27432> ssi:boot:rsh: starting on n1 (c0102): hboot -t -c
lam-conf.lamd-d -s -I "-H
192.168.1.1 -P 33865 -n 1 -o 0"
n-1<27432> ssi:boot:rsh: launching remotely
n-1<27432> ssi:boot:rsh: attempting to execute: rsh c0102 -n 'echo $SHELL'
n-1<27432> ssi:boot:rsh: remote shell /bin/bash
n-1<27432> ssi:boot:rsh: attempting to execute: rsh c0102 -n hboot -t -c
lam-conf.lamd -d -s -I '"-H 192.168.1.1 -P 33865 -n 1 -o 0"'
n-1<27432> ssi:boot:rsh: successfully launched on n1 (c0102)
n-1<27432> ssi:boot:base:server: expecting connection from finite list
n-1<27432> ssi:boot:base:server: got connection from 192.168.1.2
n-1<27432> ssi:boot:base:server: this connection is expected (n1)
-----------------------------------------------------------------------------
The lamboot agent failed to read a message over a socket from the
newly-booted process. This should not happen (especially since TCP is
a guaranteed protocol).
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
You should probably check the following:
- Network connectivity: Ensure that messages can be passed reliably
over TCP using random ports.
- Environment / PATH settings: Ensure that you are running the same
version of LAM/MPI on all nodes. Sometimes premature disconnects
(and therefore this error message) may be caused if mismatched
versions of LAM are used on different nodes.
- Node health: Ensure that the host where the newly-booted process was
launched is healthy and still available on the network.
-----------------------------------------------------------------------------
n-1<27432> ssi:boot:base:server: failed to connect to remote lamd!
n-1<27432> ssi:boot:base:server: closing server socket
n-1<27432> ssi:boot:base:linear: aborted!
n-1<27438> ssi:boot:open: opening
n-1<27438> ssi:boot:open: opening boot module globus
n-1<27438> ssi:boot:open: opened boot module globus
n-1<27438> ssi:boot:open: opening boot module rsh
n-1<27438> ssi:boot:open: opened boot module rsh
n-1<27438> ssi:boot:open: opening boot module slurm
n-1<27438> ssi:boot:open: opened boot module slurm
n-1<27438> ssi:boot:select: initializing boot module slurm
n-1<27438> ssi:boot:slurm: not running under SLURM
n-1<27438> ssi:boot:select: boot module not available: slurm
n-1<27438> ssi:boot:select: initializing boot module globus
n-1<27438> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<27438> ssi:boot:select: boot module not available: globus
n-1<27438> ssi:boot:select: initializing boot module rsh
n-1<27438> ssi:boot:rsh: module initializing
n-1<27438> ssi:boot:rsh:agent: rsh
n-1<27438> ssi:boot:rsh:username: <same>
n-1<27438> ssi:boot:rsh:verbose: 1000
n-1<27438> ssi:boot:rsh:algorithm: linear
n-1<27438> ssi:boot:rsh:no_n: 0
n-1<27438> ssi:boot:rsh:no_profile: 0
n-1<27438> ssi:boot:rsh:fast: 0
n-1<27438> ssi:boot:rsh:ignore_stderr: 0
n-1<27438> ssi:boot:rsh:priority: 75
n-1<27438> ssi:boot:select: boot module available: rsh, priority: 75
n-1<27438> ssi:boot:select: finalizing boot module slurm
n-1<27438> ssi:boot:slurm: finalizing
n-1<27438> ssi:boot:select: closing boot module slurm
n-1<27438> ssi:boot:select: finalizing boot module globus
n-1<27438> ssi:boot:globus: finalizing
n-1<27438> ssi:boot:select: closing boot module globus
n-1<27438> ssi:boot:select: selected boot module rsh
n-1<27438> ssi:boot:base: looking for boot schema in following directories:
n-1<27438> ssi:boot:base: <current directory>
n-1<27438> ssi:boot:base: $TROLLIUSHOME/etc
n-1<27438> ssi:boot:base: $LAMHOME/etc
n-1<27438> ssi:boot:base: /export/home/alookdo/lam/etc
n-1<27438> ssi:boot:base: looking for boot schema file:
n-1<27438> ssi:boot:base: lam-bhost.def
n-1<27438> ssi:boot:base: found boot schema:
/export/home/alookdo/lam/etc/lam-bhost.def
n-1<27438> ssi:boot:rsh: found the following hosts:
n-1<27438> ssi:boot:rsh: n0 c0101 (cpu=1)
n-1<27438> ssi:boot:rsh: n1 c0102 (cpu=1)
n-1<27438> ssi:boot:rsh: n2 c0103 (cpu=1)
n-1<27438> ssi:boot:rsh: n3 c0104 (cpu=1)
n-1<27438> ssi:boot:rsh: n4 c0105 (cpu=1)
n-1<27438> ssi:boot:rsh: resolved hosts:
n-1<27438> ssi:boot:rsh: n0 c0101 --> 192.168.1.1 (origin)
n-1<27438> ssi:boot:rsh: n1 c0102 --> 192.168.1.2
n-1<27438> ssi:boot:rsh: n2 c0103 --> 192.168.1.3
n-1<27438> ssi:boot:rsh: n3 c0104 --> 192.168.1.4
n-1<27438> ssi:boot:rsh: n4 c0105 --> 192.168.1.5
n-1<27438> ssi:boot:rsh: starting RTE procs
n-1<27438> ssi:boot:base:linear: starting
n-1<27438> ssi:boot:base:linear: booting n0 (c0101)
n-1<27438> ssi:boot:rsh: starting wipe on (c0101)
n-1<27438> ssi:boot:rsh: starting on n0 (c0101): tkill -d
n-1<27438> ssi:boot:rsh: launching locally
n-1<27438> ssi:boot:rsh: successfully launched on n0 (c0101)
n-1<27438> ssi:boot:base:linear: booting n1 (c0102)
n-1<27438> ssi:boot:rsh: starting wipe on (c0102)
n-1<27438> ssi:boot:rsh: starting on n1 (c0102): tkill -d
n-1<27438> ssi:boot:rsh: launching remotely
n-1<27438> ssi:boot:rsh: attempting to execute: rsh c0102 -n 'echo $SHELL'
n-1<27438> ssi:boot:rsh: remote shell /bin/bash
n-1<27438> ssi:boot:rsh: attempting to execute: rsh c0102 -n tkill -d
n-1<27438> ssi:boot:rsh: successfully launched on n1 (c0102)
n-1<27438> ssi:boot:base:linear: booting n2 (c0103)
n-1<27438> ssi:boot:rsh: starting wipe on (c0103)
n-1<27438> ssi:boot:rsh: starting on n2 (c0103): tkill -d
n-1<27438> ssi:boot:rsh: launching remotely
n-1<27438> ssi:boot:rsh: attempting to execute: rsh c0103 -n 'echo $SHELL'
n-1<27438> ssi:boot:rsh: remote shell /bin/bash
n-1<27438> ssi:boot:rsh: attempting to execute: rsh c0103 -n tkill -d
n-1<27438> ssi:boot:rsh: successfully launched on n2 (c0103)
n-1<27438> ssi:boot:base:linear: booting n3 (c0104)
n-1<27438> ssi:boot:rsh: starting wipe on (c0104)
n-1<27438> ssi:boot:rsh: starting on n3 (c0104): tkill -d
n-1<27438> ssi:boot:rsh: launching remotely
n-1<27438> ssi:boot:rsh: attempting to execute: rsh c0104 -n 'echo $SHELL'
n-1<27438> ssi:boot:rsh: remote shell /bin/bash
n-1<27438> ssi:boot:rsh: attempting to execute: rsh c0104 -n tkill -d
n-1<27438> ssi:boot:rsh: successfully launched on n3 (c0104)
n-1<27438> ssi:boot:base:linear: booting n4 (c0105)
n-1<27438> ssi:boot:rsh: starting wipe on (c0105)
n-1<27438> ssi:boot:rsh: starting on n4 (c0105): tkill -d
n-1<27438> ssi:boot:rsh: launching remotely
n-1<27438> ssi:boot:rsh: attempting to execute: rsh c0105 -n 'echo $SHELL'
n-1<27438> ssi:boot:rsh: remote shell /bin/bash
n-1<27438> ssi:boot:rsh: attempting to execute: rsh c0105 -n tkill -d
n-1<27438> ssi:boot:rsh: successfully launched on n4 (c0105)
n-1<27438> ssi:boot:base:linear: finished
n-1<27438> ssi:boot:rsh: all RTE procs started
n-1<27438> ssi:boot:rsh: finalizing
n-1<27438> ssi:boot: Closing
lamboot did NOT complete successfully
|