Hello again,
Previous mail posted by me is not precise. I've found that after some
redefinition of lam-conf.lamd as follows
/export/home/alookdo/lam/bin/lamd $inet_topo $debug $session_prefix
$session_suffix
Then 'lamboot -d' run to the following: Thanks again and sorry for
previous post.
[alookdo_at_c0101 alookdo]$ lamboot -d
n-1<8165> ssi:boot:open: opening
n-1<8165> ssi:boot:open: opening boot module globus
n-1<8165> ssi:boot:open: opened boot module globus
n-1<8165> ssi:boot:open: opening boot module rsh
n-1<8165> ssi:boot:open: opened boot module rsh
n-1<8165> ssi:boot:open: opening boot module slurm
n-1<8165> ssi:boot:open: opened boot module slurm
n-1<8165> ssi:boot:select: initializing boot module slurm
n-1<8165> ssi:boot:slurm: not running under SLURM
n-1<8165> ssi:boot:select: boot module not available: slurm
n-1<8165> ssi:boot:select: initializing boot module globus
n-1<8165> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<8165> ssi:boot:select: boot module not available: globus
n-1<8165> ssi:boot:select: initializing boot module rsh
n-1<8165> ssi:boot:rsh: module initializing
n-1<8165> ssi:boot:rsh:agent: rsh
n-1<8165> ssi:boot:rsh:username: <same>
n-1<8165> ssi:boot:rsh:verbose: 1000
n-1<8165> ssi:boot:rsh:algorithm: linear
n-1<8165> ssi:boot:rsh:no_n: 0
n-1<8165> ssi:boot:rsh:no_profile: 0
n-1<8165> ssi:boot:rsh:fast: 0
n-1<8165> ssi:boot:rsh:ignore_stderr: 0
n-1<8165> ssi:boot:rsh:priority: 75
n-1<8165> ssi:boot:select: boot module available: rsh, priority: 75
n-1<8165> ssi:boot:select: finalizing boot module slurm
n-1<8165> ssi:boot:slurm: finalizing
n-1<8165> ssi:boot:select: closing boot module slurm
n-1<8165> ssi:boot:select: finalizing boot module globus
n-1<8165> ssi:boot:globus: finalizing
n-1<8165> ssi:boot:select: closing boot module globus
n-1<8165> ssi:boot:select: selected boot module rsh
LAM 7.1.2/MPI 2 C++ - Indiana University
n-1<8165> ssi:boot:base: looking for boot schema in following directories:
n-1<8165> ssi:boot:base: <current directory>
n-1<8165> ssi:boot:base: $TROLLIUSHOME/etc
n-1<8165> ssi:boot:base: $LAMHOME/etc
n-1<8165> ssi:boot:base: /export/home/alookdo/lam/etc
n-1<8165> ssi:boot:base: looking for boot schema file:
n-1<8165> ssi:boot:base: lam-bhost.def
n-1<8165> ssi:boot:base: found boot schema:
/export/home/alookdo/lam/etc/lam-bhost.def
n-1<8165> ssi:boot:rsh: found the following hosts:
n-1<8165> ssi:boot:rsh: n0 c0101 (cpu=1)
n-1<8165> ssi:boot:rsh: n1 c0102 (cpu=1)
n-1<8165> ssi:boot:rsh: n2 c0103 (cpu=1)
n-1<8165> ssi:boot:rsh: resolved hosts:
n-1<8165> ssi:boot:rsh: n0 c0101 --> 192.168.1.1 (origin)
n-1<8165> ssi:boot:rsh: n1 c0102 --> 192.168.1.2
n-1<8165> ssi:boot:rsh: n2 c0103 --> 192.168.1.3
n-1<8165> ssi:boot:rsh: starting RTE procs
n-1<8165> ssi:boot:base:linear: starting
n-1<8165> ssi:boot:base:server: opening server TCP socket
n-1<8165> ssi:boot:base:server: opened port 33973
n-1<8165> ssi:boot:base:linear: booting n0 (c0101)
n-1<8165> ssi:boot:rsh: starting lamd on (c0101)
n-1<8165> ssi:boot:rsh: starting on n0 (c0101): hboot -t -c lam-conf.lamd -d
-I -H 192.168.1.1 -P 33973 -n 0 -o 0
n-1<8165> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-alookdo_at_c0101/lam-killfile
tkill: f_kill = "/tmp/lam-alookdo_at_c0101/lam-killfile"
tkill: nothing to kill: "/tmp/lam-alookdo_at_c0101/lam-killfile"
hboot: booting...
hboot: fork /export/home/alookdo/lam/bin/lamd
hboot: attempting to execute
[1] 8168 lamd -H 192.168.1.1 -P 33973 -n 0 -o 0 -d
n-1<8165> ssi:boot:rsh: successfully launched on n0 (c0101)
n-1<8165> ssi:boot:base:server: expecting connection from finite list
n-1<8168> ssi:boot:open: opening
n-1<8168> ssi:boot:open: opening boot module globus
n-1<8168> ssi:boot:open: opened boot module globus
n-1<8168> ssi:boot:open: opening boot module rsh
n-1<8168> ssi:boot:open: opened boot module rsh
n-1<8168> ssi:boot:open: opening boot module slurm
n-1<8168> ssi:boot:open: opened boot module slurm
n-1<8168> ssi:boot:select: initializing boot module slurm
n-1<8168> ssi:boot:slurm: not running under SLURM
n-1<8168> ssi:boot:select: boot module not available: slurm
n-1<8168> ssi:boot:select: initializing boot module globus
n-1<8168> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<8168> ssi:boot:select: boot module not available: globus
n-1<8168> ssi:boot:select: initializing boot module rsh
n-1<8168> ssi:boot:rsh: module initializing
n-1<8168> ssi:boot:rsh:agent: rsh
n-1<8168> ssi:boot:rsh:username: <same>
n-1<8168> ssi:boot:rsh:verbose: 1000
n-1<8168> ssi:boot:rsh:algorithm: linear
n-1<8168> ssi:boot:rsh:no_n: 0
n-1<8168> ssi:boot:rsh:no_profile: 0
n-1<8168> ssi:boot:rsh:fast: 0
n-1<8168> ssi:boot:rsh:ignore_stderr: 0
n-1<8168> ssi:boot:rsh:priority: 75
n-1<8168> ssi:boot:select: boot module available: rsh, priority: 75
n-1<8168> ssi:boot:select: finalizing boot module slurm
n-1<8168> ssi:boot:slurm: finalizing
n-1<8168> ssi:boot:select: closing boot module slurm
n-1<8168> ssi:boot:select: finalizing boot module globus
n-1<8168> ssi:boot:globus: finalizing
n-1<8168> ssi:boot:select: closing boot module globus
n-1<8168> ssi:boot:select: selected boot module rsh
n-1<8168> ssi:boot:send_lamd: getting node ID from command line
n-1<8168> ssi:boot:send_lamd: getting agent haddr from command line
n-1<8168> ssi:boot:send_lamd: getting agent port from command line
n-1<8168> ssi:boot:send_lamd: getting node ID from command line
n-1<8168> ssi:boot:send_lamd: connecting to 192.168.1.1:33973, node id 0
n-1<8168> ssi:boot:send_lamd: sending dli_port 32919
n-1<8165> ssi:boot:base:server: got connection from 192.168.1.1
n-1<8165> ssi:boot:base:server: this connection is expected (n0)
n-1<8165> ssi:boot:base:server: remote lamd is at 192.168.1.1:32919
n-1<8165> ssi:boot:base:linear: booting n1 (c0102)
n-1<8165> ssi:boot:rsh: starting lamd on (c0102)
n-1<8165> ssi:boot:rsh: starting on n1 (c0102): hboot -t -c lam-conf.lamd -d
-s -I "-H 192.168.1.1 -P 33973 -n 1 -o 0"
n-1<8165> ssi:boot:rsh: launching remotely
n-1<8165> ssi:boot:rsh: attempting to execute: rsh c0102 -n 'echo $SHELL'
n-1<8165> ssi:boot:rsh: remote shell /bin/bash
n-1<8165> ssi:boot:rsh: attempting to execute: rsh c0102 -n hboot -t -c
lam-conf.lamd -d -s -I '"-H 192.168.1.1 -P 33973 -n 1 -o 0"'
hboot: process schema = "lam-conf.lamd"
hboot: found /export/home/alookdo/lam/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /export/home/alookdo/lam/bin/lamd
[1] 9500 lamd -H 192.168.1.1 -P 33973 -n 1 -o 0 -d
n-1<8165> ssi:boot:rsh: successfully launched on n1 (c0102)
n-1<8165> ssi:boot:base:server: expecting connection from finite list
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:
- There are network filters between the lamboot agent host and
the remote host such that communication on random TCP ports
is blocked
- Network routing from the remote host to the local host isn't
properly configured (this is uncommon)
You can check these things by watching the output from "lamboot -d".
1. On the command line for hboot, there are two important parameters:
one is the IP address of where the lamboot agent was invoked, the
other is the port number that the lamboot agent is expecting the
newly-booted process to call back on (this will be a random
integer).
2. Manually login to the remote machine and try to telnet to the port
indicated on the hboot command line. For example,
telnet <ipnumber> <portnumber>
If all goes well, you should get a "Connection refused" error. If
you get any other kind of error, it could indicate either of the
two conditions above. Consult with your system/network
administrator.
-----------------------------------------------------------------------------
n-1<8165> ssi:boot:base:server: failed to connect to remote lamd!
n-1<8165> ssi:boot:base:server: closing server socket
n-1<8165> ssi:boot:base:linear: aborted!
n-1<8253> ssi:boot:open: opening
n-1<8253> ssi:boot:open: opening boot module globus
n-1<8253> ssi:boot:open: opened boot module globus
n-1<8253> ssi:boot:open: opening boot module rsh
n-1<8253> ssi:boot:open: opened boot module rsh
n-1<8253> ssi:boot:open: opening boot module slurm
n-1<8253> ssi:boot:open: opened boot module slurm
n-1<8253> ssi:boot:select: initializing boot module slurm
n-1<8253> ssi:boot:slurm: not running under SLURM
n-1<8253> ssi:boot:select: boot module not available: slurm
n-1<8253> ssi:boot:select: initializing boot module globus
n-1<8253> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<8253> ssi:boot:select: boot module not available: globus
n-1<8253> ssi:boot:select: initializing boot module rsh
n-1<8253> ssi:boot:rsh: module initializing
n-1<8253> ssi:boot:rsh:agent: rsh
n-1<8253> ssi:boot:rsh:username: <same>
n-1<8253> ssi:boot:rsh:verbose: 1000
n-1<8253> ssi:boot:rsh:algorithm: linear
n-1<8253> ssi:boot:rsh:no_n: 0
n-1<8253> ssi:boot:rsh:no_profile: 0
n-1<8253> ssi:boot:rsh:fast: 0
n-1<8253> ssi:boot:rsh:ignore_stderr: 0
n-1<8253> ssi:boot:rsh:priority: 75
n-1<8253> ssi:boot:select: boot module available: rsh, priority: 75
n-1<8253> ssi:boot:select: finalizing boot module slurm
n-1<8253> ssi:boot:slurm: finalizing
n-1<8253> ssi:boot:select: closing boot module slurm
n-1<8253> ssi:boot:select: finalizing boot module globus
n-1<8253> ssi:boot:globus: finalizing
n-1<8253> ssi:boot:select: closing boot module globus
n-1<8253> ssi:boot:select: selected boot module rsh
n-1<8253> ssi:boot:base: looking for boot schema in following directories:
n-1<8253> ssi:boot:base: <current directory>
n-1<8253> ssi:boot:base: $TROLLIUSHOME/etc
n-1<8253> ssi:boot:base: $LAMHOME/etc
n-1<8253> ssi:boot:base: /export/home/alookdo/lam/etc
n-1<8253> ssi:boot:base: looking for boot schema file:
n-1<8253> ssi:boot:base: lam-bhost.def
n-1<8253> ssi:boot:base: found boot schema:
/export/home/alookdo/lam/etc/lam-bhost.def
n-1<8253> ssi:boot:rsh: found the following hosts:
n-1<8253> ssi:boot:rsh: n0 c0101 (cpu=1)
n-1<8253> ssi:boot:rsh: n1 c0102 (cpu=1)
n-1<8253> ssi:boot:rsh: n2 c0103 (cpu=1)
n-1<8253> ssi:boot:rsh: resolved hosts:
n-1<8253> ssi:boot:rsh: n0 c0101 --> 192.168.1.1 (origin)
n-1<8253> ssi:boot:rsh: n1 c0102 --> 192.168.1.2
n-1<8253> ssi:boot:rsh: n2 c0103 --> 192.168.1.3
n-1<8253> ssi:boot:rsh: starting RTE procs
n-1<8253> ssi:boot:base:linear: starting
n-1<8253> ssi:boot:base:linear: booting n0 (c0101)
n-1<8253> ssi:boot:rsh: starting wipe on (c0101)
n-1<8253> ssi:boot:rsh: starting on n0 (c0101): tkill -d
n-1<8253> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-alookdo_at_c0101/lam-killfile
tkill: f_kill = "/tmp/lam-alookdo_at_c0101/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 8168 ...
tkill: killed
tkill: removing socket file ...
tkill: socket file: /tmp/lam-alookdo_at_c0101/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-alookdo_at_c0101/lam-io-socket
tkill: all finished
n-1<8253> ssi:boot:rsh: successfully launched on n0 (c0101)
n-1<8253> ssi:boot:base:linear: booting n1 (c0102)
n-1<8253> ssi:boot:rsh: starting wipe on (c0102)
n-1<8253> ssi:boot:rsh: starting on n1 (c0102): tkill -d
n-1<8253> ssi:boot:rsh: launching remotely
n-1<8253> ssi:boot:rsh: attempting to execute: rsh c0102 -n 'echo $SHELL'
n-1<8253> ssi:boot:rsh: remote shell /bin/bash
n-1<8253> ssi:boot:rsh: attempting to execute: rsh c0102 -n tkill -d
tkill: removing socket file ...
tkill: socket file: /tmp/lam-alookdo_at_c0102/lam-sd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-alookdo_at_c0102/lam-sio
tkill: f_kill = "/tmp/lam-alookdo_at_c0102/lam"
tkill: nothing to kill: "/tmp/lam-alookdo_at_c0102/lam"
n-1<8253> ssi:boot:rsh: successfully launched on n1 (c0102)
n-1<8253> ssi:boot:base:linear: booting n2 (c0103)
n-1<8253> ssi:boot:rsh: starting wipe on (c0103)
n-1<8253> ssi:boot:rsh: starting on n2 (c0103): tkill -d
n-1<8253> ssi:boot:rsh: launching remotely
n-1<8253> ssi:boot:rsh: attempting to execute: rsh c0103 -n 'echo $SHELL'
n-1<8253> ssi:boot:rsh: remote shell /bin/bash
n-1<8253> ssi:boot:rsh: attempting to execute: rsh c0103 -n tkill -d
tkill: removing socket file ...
tkill: socket file: /tmp/lam-alookdo_at_c0103/lam-sd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-alookdo_at_c0103/lam-sio
tkill: f_kill = "/tmp/lam-alookdo_at_c0103/lam"
tkill: nothing to kill: "/tmp/lam-alookdo_at_c0103/lam"
n-1<8253> ssi:boot:rsh: successfully launched on n2 (c0103)
n-1<8253> ssi:boot:base:linear: finished
n-1<8253> ssi:boot:rsh: all RTE procs started
n-1<8253> ssi:boot:rsh: finalizing
n-1<8253> ssi:boot: Closing
lamboot did NOT complete successfully
|