The important lines in your output are these:
-----
n-1<10600> ssi:boot:rsh: successfully launched on n1 (168.158.222.152)
n-1<10600> ssi:boot:base:server: expecting connection from finite list
--------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.
-----
What this means is that LAM was able to launch its process remotely,
but then never received an expected callback on a specific TCP socket.
Do you have any firewalling between these two machines? LAM uses
random TCP and UDP ports to communicate -- it needs to be able to
open connections / send UDP packets on any arbitrary [non-privileged]
ports. Check for port blocking; let us know if you don't have any
and we can investigate further.
On Feb 10, 2006, at 4:48 PM, pele_smk wrote:
> I'm having a problem starting lamboot. I'm using ssh keys and can
> ssh to the machines without incident. But lamboot still will not
> execute.
> I have ssh keys setup on the machines and am using the command
>
> $lamboot -d machinefile.all
>
> I watch var/log/messages on the guest machine I recieve:
>
> sshd session opened for user parallel
> sshd session closed for user parallel
> sshd session opened for user parallel
> lamd started (7.1.1) uid 501, gid 501
> lamd kernel initialized
> sshd session closed for user parallel
>
> While watching /var/log/messages on the host machine I recieve:
> lamd: started (7.1.1), uid 503, gid 503
> lamd: kernel: initialized
>
>
> When I execute $lamboot -d machinefile.all I recieve:
>
> n-1<10600> ssi:boot:open: opening
> n-1<10600> ssi:boot:open: opening boot module globus
> n-1<10600> ssi:boot:open: opened boot module globus
> n-1<10600> ssi:boot:open: opening boot module rsh
> n-1<10600> ssi:boot:open: opened boot module rsh
> n-1<10600> ssi:boot:open: opening boot module slurm
> n-1<10600> ssi:boot:open: opened boot module slurm
> n-1<10600> ssi:boot:select: initializing boot module globus
> n-1<10600> ssi:boot:globus: globus-job-run not found, globus boot
> will not run
> n-1<10600> ssi:boot:select: boot module not available: globus
> n-1<10600> ssi:boot:select: initializing boot module rsh
> n-1<10600> ssi:boot:rsh: module initializing
> n-1<10600> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
> n-1<10600> ssi:boot:rsh:username: <same>
> n-1<10600> ssi:boot:rsh:verbose: 1000
> n-1<10600> ssi:boot:rsh:algorithm: linear
> n-1<10600> ssi:boot:rsh:no_n: 0
> n-1<10600> ssi:boot:rsh:no_profile: 0
> n-1<10600> ssi:boot:rsh:fast: 0
> n-1<10600> ssi:boot:rsh:ignore_stderr: 0
> n-1<10600> ssi:boot:rsh:priority: 10
> n-1<10600> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<10600> ssi:boot:select: initializing boot module slurm
> n-1<10600> ssi:boot:slurm: not running under SLURM
> n-1<10600> ssi:boot:select: boot module not available: slurm
> n-1<10600> ssi:boot:select: finalizing boot module globus
> n-1<10600> ssi:boot:globus: finalizing
> n-1<10600> ssi:boot:select: closing boot module globus
> n-1<10600> ssi:boot:select: finalizing boot module slurm
> n-1<10600> ssi:boot:slurm: finalizing
> n-1<10600> ssi:boot:select: closing boot module slurm
> n-1<10600> ssi:boot:select: selected boot module rsh
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<10600> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<10600> ssi:boot:base: <current directory>
> n-1<10600> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<10600> ssi:boot:base: $LAMHOME/etc
> n-1<10600> ssi:boot:base: /etc/lam
> n-1<10600> ssi:boot:base: looking for boot schema file:
> n-1<10600> ssi:boot:base: machinefile.all
> n-1<10600> ssi:boot:base: found boot schema: machinefile.all
> n-1<10600> ssi:boot:rsh: found the following hosts:
> n-1<10600> ssi:boot:rsh: n0 168.158.222.80 (cpu=1)
> n-1<10600> ssi:boot:rsh: n1 168.158.222.152 (cpu=1)
> n-1<10600> ssi:boot:rsh: resolved hosts:
> n-1<10600> ssi:boot:rsh: n0 168.158.222.80 --> 168.158.222.80
> (origin)
> n-1<10600> ssi:boot:rsh: n1 168.158.222.152 --> 168.158.222.152
> n-1<10600> ssi:boot:rsh: starting RTE procs
> n-1<10600> ssi:boot:base:linear: starting
> n-1<10600> ssi:boot:base:server: opening server TCP socket
> n-1<10600> ssi:boot:base:server: opened port 32856
> n-1<10600> ssi:boot:base:linear: booting n0 (168.158.222.80)
> n-1<10600> ssi:boot:rsh: starting lamd on (168.158.222.80)
> n-1<10600> ssi:boot:rsh: starting on n0 (168.158.222.80): hboot -t -
> c lam-conf.lamd -d -I -H 168.158.222.80 -P 32856 -n 0 -o 0
> n-1<10600> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-parallel_at_Solar1202.localdomain/
> lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-parallel_at_Solar1202.localdomain/lam-
> kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-
> parallel_at_Solar1202.localdomain/lam-io-socket
> tkill: f_kill = "/tmp/lam-parallel_at_Solar1202.localdomain/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-parallel_at_Solar1202.localdomain/
> lam-killfile"
> hboot: booting...
> hboot: fork /usr/bin/lamd
> hboot: attempting to execute
> n-1<10603> ssi:boot:open: opening
> n-1<10603> ssi:boot:open: opening boot module globus
> n-1<10603> ssi:boot:open: opened boot module globus
> n-1<10603> ssi:boot:open: opening boot module rsh
> n-1<10603> ssi:boot:open: opened boot module rsh
> n-1<10603> ssi:boot:open: opening boot module slurm
> n-1<10603> ssi:boot:open: opened boot module slurm
> n-1<10603> ssi:boot:select: initializing boot module globus
> n-1<10603> ssi:boot:globus: globus-job-run not found, globus boot
> will not run
> n-1<10603> ssi:boot:select: boot module not available: globus
> n-1<10603> ssi:boot:select: initializing boot module rsh
> n-1<10603> ssi:boot:rsh: module initializing
> n-1<10603> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
> n-1<10603> ssi:boot:rsh:username: <same>
> n-1<10603> ssi:boot:rsh:verbose: 1000
> n-1<10603> ssi:boot:rsh:algorithm: linear
> n-1<10603> ssi:boot:rsh:no_n: 0
> n-1<10603> ssi:boot:rsh:no_profile: 0
> n-1<10603> ssi:boot:rsh:fast: 0
> n-1<10603> ssi:boot:rsh:ignore_stderr: 0
> n-1<10603> ssi:boot:rsh:priority: 10
> n-1<10603> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<10603> ssi:boot:select: initializing boot module slurm
> n-1<10603> ssi:boot:slurm: not running under SLURM
> n-1<10603> ssi:boot:select: boot module not available: slurm
> n-1<10603> ssi:boot:select: finalizing boot module globus
> n-1<10603> ssi:boot:globus: finalizing
> n-1<10603> ssi:boot:select: closing boot module globus
> n-1<10603> ssi:boot:select: finalizing boot module slurm
> n-1<10603> ssi:boot:slurm: finalizing
> n-1<10603> ssi:boot:select: closing boot module slurm
> n-1<10603> ssi:boot:select: selected boot module rsh
> n-1<10603> ssi:boot:send_lamd: getting node ID from command line
> n-1<10603> ssi:boot:send_lamd: getting agent haddr from command line
> n-1<10603> ssi:boot:send_lamd: getting agent port from command line
> n-1<10603> ssi:boot:send_lamd: getting node ID from command line
> n-1<10603> ssi:boot:send_lamd: connecting to 168.158.222.80:32856,
> node id 0
> n-1<10603> ssi:boot:send_lamd: sending dli_port 33037
> [1] 10603 lamd -H 168.158.222.80 -P 32856 -n 0 -o 0 -d
> n-1<10600> ssi:boot:rsh: successfully launched on n0 (168.158.222.80)
> n-1<10600> ssi:boot:base:server: expecting connection from finite list
> n-1<10600> ssi:boot:base:server: got connection from 168.158.222.80
> n-1<10600> ssi:boot:base:server: this connection is expected (n0)
> n-1<10600> ssi:boot:base:server: remote lamd is at
> 168.158.222.80:33037
> n-1<10600> ssi:boot:base:linear: booting n1 (168.158.222.152)
> n-1<10600> ssi:boot:rsh: starting lamd on (168.158.222.152)
> n-1<10600> ssi:boot:rsh: starting on n1 (168.158.222.152): hboot -t
> -c lam-conf.lamd -d -s -I "-H 168.158.222.80 -P 32856 -n 1 -o 0"
> n-1<10600> ssi:boot:rsh: launching remotely
> n-1<10600> ssi:boot:rsh: attempting to execute: /usr/bin/ssh -x -a
> 168.158.222.152 -n 'echo $SHELL'
> n-1<10600> ssi:boot:rsh: remote shell /bin/bash
> n-1<10600> ssi:boot:rsh: attempting to execute: /usr/bin/ssh -x -a
> 168.158.222.152 -n hboot -t -c lam-conf.lamd -d -s -I '"-H
> 168.158.222.80 -P 32856 -n 1 -o 0"'
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-parallel_at_Solar1204.localdomain/
> lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-parallel_at_Solar1204.localdomain/lam-
> kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-
> parallel_at_Solar1204.localdomain/lam-io-socket
> tkill: f_kill = "/tmp/lam-parallel_at_Solar1204.localdomain/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-parallel_at_Solar1204.localdomain/
> lam-killfile"
> hboot: performing tkill
> hboot: tkill -d
> hboot: booting...
> hboot: fork /usr/bin/lamd
> [1] 15798 lamd -H 168.158.222.80 -P 32856 -n 1 -o 0 -d
> n-1<10600> ssi:boot:rsh: successfully launched on n1 (168.158.222.152)
> n-1<10600> ssi:boot:base:server: expecting connection from finite list
>
> ----------------------------------------------------------------------
> -------
> The lamboot agent timed out while waiting for the newly-booted process
> to call back and indicated that it had successfully booted.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> As far as LAM could tell, the remote process started properly, but
> then never called back. Possible reasons that this may happen:
>
> - There are network filters between the lamboot agent host and
> the remote host such that communication on random TCP ports
> is blocked
> - Network routing from the remote host to the local host isn't
> properly configured (this is uncommon)
>
> You can check these things by watching the output from "lamboot -d".
>
> 1. On the command line for hboot, there are two important parameters:
> one is the IP address of where the lamboot agent was invoked, the
> other is the port number that the lamboot agent is expecting the
> newly-booted process to call back on (this will be a random
> integer).
>
> 2. Manually login to the remote machine and try to telnet to the port
> indicated on the hboot command line. For example,
> telnet <ipnumber> <portnumber>
> If all goes well, you should get a "Connection refused" error. If
> you get any other kind of error, it could indicate either of the
> two conditions above. Consult with your system/network
> administrator.
> ----------------------------------------------------------------------
> -------
> n-1<10600> ssi:boot:base:server: failed to connect to remote lamd!
> n-1<10600> ssi:boot:base:server: closing server socket
> n-1<10600> ssi:boot:base:linear: aborted!
> n-1<10616> ssi:boot:open: opening
> n-1<10616> ssi:boot:open: opening boot module globus
> n-1<10616> ssi:boot:open: opened boot module globus
> n-1<10616> ssi:boot:open: opening boot module rsh
> n-1<10616> ssi:boot:open: opened boot module rsh
> n-1<10616> ssi:boot:open: opening boot module slurm
> n-1<10616> ssi:boot:open: opened boot module slurm
> n-1<10616> ssi:boot:select: initializing boot module globus
> n-1<10616> ssi:boot:globus: globus-job-run not found, globus boot
> will not run
> n-1<10616> ssi:boot:select: boot module not available: globus
> n-1<10616> ssi:boot:select: initializing boot module rsh
> n-1<10616> ssi:boot:rsh: module initializing
> n-1<10616> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
> n-1<10616> ssi:boot:rsh:username: <same>
> n-1<10616> ssi:boot:rsh:verbose: 1000
> n-1<10616> ssi:boot:rsh:algorithm: linear
> n-1<10616> ssi:boot:rsh:no_n: 0
> n-1<10616> ssi:boot:rsh:no_profile: 0
> n-1<10616> ssi:boot:rsh:fast: 0
> n-1<10616> ssi:boot:rsh:ignore_stderr: 0
> n-1<10616> ssi:boot:rsh:priority: 10
> n-1<10616> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<10616> ssi:boot:select: initializing boot module slurm
> n-1<10616> ssi:boot:slurm: not running under SLURM
> n-1<10616> ssi:boot:select: boot module not available: slurm
> n-1<10616> ssi:boot:select: finalizing boot module globus
> n-1<10616> ssi:boot:globus: finalizing
> n-1<10616> ssi:boot:select: closing boot module globus
> n-1<10616> ssi:boot:select: finalizing boot module slurm
> n-1<10616> ssi:boot:slurm: finalizing
> n-1<10616> ssi:boot:select: closing boot module slurm
> n-1<10616> ssi:boot:select: selected boot module rsh
> n-1<10616> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<10616> ssi:boot:base: <current directory>
> n-1<10616> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<10616> ssi:boot:base: $LAMHOME/etc
> n-1<10616> ssi:boot:base: /etc/lam
> n-1<10616> ssi:boot:base: looking for boot schema file:
> n-1<10616> ssi:boot:base: machinefile.all
> n-1<10616> ssi:boot:base: found boot schema: machinefile.all
> n-1<10616> ssi:boot:rsh: found the following hosts:
> n-1<10616> ssi:boot:rsh: n0 168.158.222.80 (cpu=1)
> n-1<10616> ssi:boot:rsh: n1 168.158.222.152 (cpu=1)
> n-1<10616> ssi:boot:rsh: resolved hosts:
> n-1<10616> ssi:boot:rsh: n0 168.158.222.80 --> 168.158.222.80
> (origin)
> n-1<10616> ssi:boot:rsh: n1 168.158.222.152 --> 168.158.222.152
> n-1<10616> ssi:boot:rsh: starting RTE procs
> n-1<10616> ssi:boot:base:linear: starting
> n-1<10616> ssi:boot:base:linear: booting n0 (168.158.222.80)
> n-1<10616> ssi:boot:rsh: starting wipe on (168.158.222.80)
> n-1<10616> ssi:boot:rsh: starting on n0 (168.158.222.80): tkill -d
> n-1<10616> ssi:boot:rsh: launching locally
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-parallel_at_Solar1202.localdomain/
> lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-parallel_at_Solar1202.localdomain/lam-
> kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-
> parallel_at_Solar1202.localdomain/lam-io-socket
> tkill: f_kill = "/tmp/lam-parallel_at_Solar1202.localdomain/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 10603 ...
> tkill: killed
> tkill: all finished
> n-1<10616> ssi:boot:rsh: successfully launched on n0 (168.158.222.80)
> n-1<10616> ssi:boot:base:linear: booting n1 (168.158.222.152)
> n-1<10616> ssi:boot:rsh: starting wipe on (168.158.222.152)
> n-1<10616> ssi:boot:rsh: starting on n1 (168.158.222.152): tkill -d
> n-1<10616> ssi:boot:rsh: launching remotely
> n-1<10616> ssi:boot:rsh: attempting to execute: /usr/bin/ssh -x -a
> 168.158.222.152 -n 'echo $SHELL'
> n-1<10616> ssi:boot:rsh: remote shell /bin/bash
> n-1<10616> ssi:boot:rsh: attempting to execute: /usr/bin/ssh -x -a
> 168.158.222.152 -n tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-parallel_at_Solar1204.localdomain/
> lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-parallel_at_Solar1204.localdomain/lam-
> kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-
> parallel_at_Solar1204.localdomain/lam-io-socket
> tkill: f_kill = "/tmp/lam-parallel_at_Solar1204.localdomain/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 15812 ...
> tkill: killed
> tkill: all finished
> n-1<10616> ssi:boot:rsh: successfully launched on n1 (168.158.222.152)
> n-1<10616> ssi:boot:base:linear: finished
> n-1<10616> ssi:boot:rsh: all RTE procs started
> n-1<10616> ssi:boot:rsh: finalizing
> n-1<10616> ssi:boot: Closing
> lamboot did NOT complete successfully
> [parallel_at_Solar1202 ~]$
>
>
>
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
|