LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: pele_smk (pelesmk_at_[hidden])
Date: 2006-02-13 16:20:43


I moved the test pair of computers to a new network without any port
blocking, but still recieve the same error.

On 2/12/06, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
> The important lines in your output are these:
>
> -----
> n-1<10600> ssi:boot:rsh: successfully launched on n1 (168.158.222.152)
> n-1<10600> ssi:boot:base:server: expecting connection from finite list
>
> --------------------------------------------------------------
> The lamboot agent timed out while waiting for the newly-booted process
> to call back and indicated that it had successfully booted.
> -----
>
> What this means is that LAM was able to launch its process remotely,
> but then never received an expected callback on a specific TCP socket.
>
> Do you have any firewalling between these two machines? LAM uses
> random TCP and UDP ports to communicate -- it needs to be able to
> open connections / send UDP packets on any arbitrary [non-privileged]
> ports. Check for port blocking; let us know if you don't have any
> and we can investigate further.
>
>
>
> On Feb 10, 2006, at 4:48 PM, pele_smk wrote:
>
> > I'm having a problem starting lamboot. I'm using ssh keys and can
> > ssh to the machines without incident. But lamboot still will not
> > execute.
> > I have ssh keys setup on the machines and am using the command
> >
> > $lamboot -d machinefile.all
> >
> > I watch var/log/messages on the guest machine I recieve:
> >
> > sshd session opened for user parallel
> > sshd session closed for user parallel
> > sshd session opened for user parallel
> > lamd started (7.1.1) uid 501, gid 501
> > lamd kernel initialized
> > sshd session closed for user parallel
> >
> > While watching /var/log/messages on the host machine I recieve:
> > lamd: started (7.1.1), uid 503, gid 503
> > lamd: kernel: initialized
> >
> >
> > When I execute $lamboot -d machinefile.all I recieve:
> >
> > n-1<10600> ssi:boot:open: opening
> > n-1<10600> ssi:boot:open: opening boot module globus
> > n-1<10600> ssi:boot:open: opened boot module globus
> > n-1<10600> ssi:boot:open: opening boot module rsh
> > n-1<10600> ssi:boot:open: opened boot module rsh
> > n-1<10600> ssi:boot:open: opening boot module slurm
> > n-1<10600> ssi:boot:open: opened boot module slurm
> > n-1<10600> ssi:boot:select: initializing boot module globus
> > n-1<10600> ssi:boot:globus: globus-job-run not found, globus boot
> > will not run
> > n-1<10600> ssi:boot:select: boot module not available: globus
> > n-1<10600> ssi:boot:select: initializing boot module rsh
> > n-1<10600> ssi:boot:rsh: module initializing
> > n-1<10600> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
> > n-1<10600> ssi:boot:rsh:username: <same>
> > n-1<10600> ssi:boot:rsh:verbose: 1000
> > n-1<10600> ssi:boot:rsh:algorithm: linear
> > n-1<10600> ssi:boot:rsh:no_n: 0
> > n-1<10600> ssi:boot:rsh:no_profile: 0
> > n-1<10600> ssi:boot:rsh:fast: 0
> > n-1<10600> ssi:boot:rsh:ignore_stderr: 0
> > n-1<10600> ssi:boot:rsh:priority: 10
> > n-1<10600> ssi:boot:select: boot module available: rsh, priority: 10
> > n-1<10600> ssi:boot:select: initializing boot module slurm
> > n-1<10600> ssi:boot:slurm: not running under SLURM
> > n-1<10600> ssi:boot:select: boot module not available: slurm
> > n-1<10600> ssi:boot:select: finalizing boot module globus
> > n-1<10600> ssi:boot:globus: finalizing
> > n-1<10600> ssi:boot:select: closing boot module globus
> > n-1<10600> ssi:boot:select: finalizing boot module slurm
> > n-1<10600> ssi:boot:slurm: finalizing
> > n-1<10600> ssi:boot:select: closing boot module slurm
> > n-1<10600> ssi:boot:select: selected boot module rsh
> >
> > LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
> >
> > n-1<10600> ssi:boot:base: looking for boot schema in following
> > directories:
> > n-1<10600> ssi:boot:base: <current directory>
> > n-1<10600> ssi:boot:base: $TROLLIUSHOME/etc
> > n-1<10600> ssi:boot:base: $LAMHOME/etc
> > n-1<10600> ssi:boot:base: /etc/lam
> > n-1<10600> ssi:boot:base: looking for boot schema file:
> > n-1<10600> ssi:boot:base: machinefile.all
> > n-1<10600> ssi:boot:base: found boot schema: machinefile.all
> > n-1<10600> ssi:boot:rsh: found the following hosts:
> > n-1<10600> ssi:boot:rsh: n0 168.158.222.80 (cpu=1)
> > n-1<10600> ssi:boot:rsh: n1 168.158.222.152 (cpu=1)
> > n-1<10600> ssi:boot:rsh: resolved hosts:
> > n-1<10600> ssi:boot:rsh: n0 168.158.222.80 --> 168.158.222.80
> > (origin)
> > n-1<10600> ssi:boot:rsh: n1 168.158.222.152 --> 168.158.222.152
> > n-1<10600> ssi:boot:rsh: starting RTE procs
> > n-1<10600> ssi:boot:base:linear: starting
> > n-1<10600> ssi:boot:base:server: opening server TCP socket
> > n-1<10600> ssi:boot:base:server: opened port 32856
> > n-1<10600> ssi:boot:base:linear: booting n0 (168.158.222.80)
> > n-1<10600> ssi:boot:rsh: starting lamd on (168.158.222.80)
> > n-1<10600> ssi:boot:rsh: starting on n0 (168.158.222.80): hboot -t -
> > c lam-conf.lamd -d -I -H 168.158.222.80 -P 32856 -n 0 -o 0
> > n-1<10600> ssi:boot:rsh: launching locally
> > hboot: performing tkill
> > hboot: tkill -d
> > tkill: setting prefix to (null)
> > tkill: setting suffix to (null)
> > tkill: got killname back: /tmp/lam-parallel_at_Solar1202.localdomain/
> > lam-killfile
> > tkill: removing socket file ...
> > tkill: socket file: /tmp/lam-parallel_at_Solar1202.localdomain/lam-
> > kernel-socketd
> > tkill: removing IO daemon socket file ...
> > tkill: IO daemon socket file: /tmp/lam-
> > parallel_at_Solar1202.localdomain/lam-io-socket
> > tkill: f_kill = "/tmp/lam-parallel_at_Solar1202.localdomain/lam-killfile"
> > tkill: nothing to kill: "/tmp/lam-parallel_at_Solar1202.localdomain/
> > lam-killfile"
> > hboot: booting...
> > hboot: fork /usr/bin/lamd
> > hboot: attempting to execute
> > n-1<10603> ssi:boot:open: opening
> > n-1<10603> ssi:boot:open: opening boot module globus
> > n-1<10603> ssi:boot:open: opened boot module globus
> > n-1<10603> ssi:boot:open: opening boot module rsh
> > n-1<10603> ssi:boot:open: opened boot module rsh
> > n-1<10603> ssi:boot:open: opening boot module slurm
> > n-1<10603> ssi:boot:open: opened boot module slurm
> > n-1<10603> ssi:boot:select: initializing boot module globus
> > n-1<10603> ssi:boot:globus: globus-job-run not found, globus boot
> > will not run
> > n-1<10603> ssi:boot:select: boot module not available: globus
> > n-1<10603> ssi:boot:select: initializing boot module rsh
> > n-1<10603> ssi:boot:rsh: module initializing
> > n-1<10603> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
> > n-1<10603> ssi:boot:rsh:username: <same>
> > n-1<10603> ssi:boot:rsh:verbose: 1000
> > n-1<10603> ssi:boot:rsh:algorithm: linear
> > n-1<10603> ssi:boot:rsh:no_n: 0
> > n-1<10603> ssi:boot:rsh:no_profile: 0
> > n-1<10603> ssi:boot:rsh:fast: 0
> > n-1<10603> ssi:boot:rsh:ignore_stderr: 0
> > n-1<10603> ssi:boot:rsh:priority: 10
> > n-1<10603> ssi:boot:select: boot module available: rsh, priority: 10
> > n-1<10603> ssi:boot:select: initializing boot module slurm
> > n-1<10603> ssi:boot:slurm: not running under SLURM
> > n-1<10603> ssi:boot:select: boot module not available: slurm
> > n-1<10603> ssi:boot:select: finalizing boot module globus
> > n-1<10603> ssi:boot:globus: finalizing
> > n-1<10603> ssi:boot:select: closing boot module globus
> > n-1<10603> ssi:boot:select: finalizing boot module slurm
> > n-1<10603> ssi:boot:slurm: finalizing
> > n-1<10603> ssi:boot:select: closing boot module slurm
> > n-1<10603> ssi:boot:select: selected boot module rsh
> > n-1<10603> ssi:boot:send_lamd: getting node ID from command line
> > n-1<10603> ssi:boot:send_lamd: getting agent haddr from command line
> > n-1<10603> ssi:boot:send_lamd: getting agent port from command line
> > n-1<10603> ssi:boot:send_lamd: getting node ID from command line
> > n-1<10603> ssi:boot:send_lamd: connecting to 168.158.222.80:32856,
> > node id 0
> > n-1<10603> ssi:boot:send_lamd: sending dli_port 33037
> > [1] 10603 lamd -H 168.158.222.80 -P 32856 -n 0 -o 0 -d
> > n-1<10600> ssi:boot:rsh: successfully launched on n0 (168.158.222.80)
> > n-1<10600> ssi:boot:base:server: expecting connection from finite list
> > n-1<10600> ssi:boot:base:server: got connection from 168.158.222.80
> > n-1<10600> ssi:boot:base:server: this connection is expected (n0)
> > n-1<10600> ssi:boot:base:server: remote lamd is at
> > 168.158.222.80:33037
> > n-1<10600> ssi:boot:base:linear: booting n1 (168.158.222.152)
> > n-1<10600> ssi:boot:rsh: starting lamd on (168.158.222.152)
> > n-1<10600> ssi:boot:rsh: starting on n1 (168.158.222.152): hboot -t
> > -c lam-conf.lamd -d -s -I "-H 168.158.222.80 -P 32856 -n 1 -o 0"
> > n-1<10600> ssi:boot:rsh: launching remotely
> > n-1<10600> ssi:boot:rsh: attempting to execute: /usr/bin/ssh -x -a
> > 168.158.222.152 -n 'echo $SHELL'
> > n-1<10600> ssi:boot:rsh: remote shell /bin/bash
> > n-1<10600> ssi:boot:rsh: attempting to execute: /usr/bin/ssh -x -a
> > 168.158.222.152 -n hboot -t -c lam-conf.lamd -d -s -I '"-H
> > 168.158.222.80 -P 32856 -n 1 -o 0"'
> > tkill: setting prefix to (null)
> > tkill: setting suffix to (null)
> > tkill: got killname back: /tmp/lam-parallel_at_Solar1204.localdomain/
> > lam-killfile
> > tkill: removing socket file ...
> > tkill: socket file: /tmp/lam-parallel_at_Solar1204.localdomain/lam-
> > kernel-socketd
> > tkill: removing IO daemon socket file ...
> > tkill: IO daemon socket file: /tmp/lam-
> > parallel_at_Solar1204.localdomain/lam-io-socket
> > tkill: f_kill = "/tmp/lam-parallel_at_Solar1204.localdomain/lam-killfile"
> > tkill: nothing to kill: "/tmp/lam-parallel_at_Solar1204.localdomain/
> > lam-killfile"
> > hboot: performing tkill
> > hboot: tkill -d
> > hboot: booting...
> > hboot: fork /usr/bin/lamd
> > [1] 15798 lamd -H 168.158.222.80 -P 32856 -n 1 -o 0 -d
> > n-1<10600> ssi:boot:rsh: successfully launched on n1 (168.158.222.152)
> > n-1<10600> ssi:boot:base:server: expecting connection from finite list
> >
> > ----------------------------------------------------------------------
> > -------
> > The lamboot agent timed out while waiting for the newly-booted process
> > to call back and indicated that it had successfully booted.
> >
> > *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> > *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> > *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> > *** MAILING LIST.
> >
> > As far as LAM could tell, the remote process started properly, but
> > then never called back. Possible reasons that this may happen:
> >
> > - There are network filters between the lamboot agent host and
> > the remote host such that communication on random TCP ports
> > is blocked
> > - Network routing from the remote host to the local host isn't
> > properly configured (this is uncommon)
> >
> > You can check these things by watching the output from "lamboot -d".
> >
> > 1. On the command line for hboot, there are two important parameters:
> > one is the IP address of where the lamboot agent was invoked, the
> > other is the port number that the lamboot agent is expecting the
> > newly-booted process to call back on (this will be a random
> > integer).
> >
> > 2. Manually login to the remote machine and try to telnet to the port
> > indicated on the hboot command line. For example,
> > telnet <ipnumber> <portnumber>
> > If all goes well, you should get a "Connection refused" error. If
> > you get any other kind of error, it could indicate either of the
> > two conditions above. Consult with your system/network
> > administrator.
> > ----------------------------------------------------------------------
> > -------
> > n-1<10600> ssi:boot:base:server: failed to connect to remote lamd!
> > n-1<10600> ssi:boot:base:server: closing server socket
> > n-1<10600> ssi:boot:base:linear: aborted!
> > n-1<10616> ssi:boot:open: opening
> > n-1<10616> ssi:boot:open: opening boot module globus
> > n-1<10616> ssi:boot:open: opened boot module globus
> > n-1<10616> ssi:boot:open: opening boot module rsh
> > n-1<10616> ssi:boot:open: opened boot module rsh
> > n-1<10616> ssi:boot:open: opening boot module slurm
> > n-1<10616> ssi:boot:open: opened boot module slurm
> > n-1<10616> ssi:boot:select: initializing boot module globus
> > n-1<10616> ssi:boot:globus: globus-job-run not found, globus boot
> > will not run
> > n-1<10616> ssi:boot:select: boot module not available: globus
> > n-1<10616> ssi:boot:select: initializing boot module rsh
> > n-1<10616> ssi:boot:rsh: module initializing
> > n-1<10616> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
> > n-1<10616> ssi:boot:rsh:username: <same>
> > n-1<10616> ssi:boot:rsh:verbose: 1000
> > n-1<10616> ssi:boot:rsh:algorithm: linear
> > n-1<10616> ssi:boot:rsh:no_n: 0
> > n-1<10616> ssi:boot:rsh:no_profile: 0
> > n-1<10616> ssi:boot:rsh:fast: 0
> > n-1<10616> ssi:boot:rsh:ignore_stderr: 0
> > n-1<10616> ssi:boot:rsh:priority: 10
> > n-1<10616> ssi:boot:select: boot module available: rsh, priority: 10
> > n-1<10616> ssi:boot:select: initializing boot module slurm
> > n-1<10616> ssi:boot:slurm: not running under SLURM
> > n-1<10616> ssi:boot:select: boot module not available: slurm
> > n-1<10616> ssi:boot:select: finalizing boot module globus
> > n-1<10616> ssi:boot:globus: finalizing
> > n-1<10616> ssi:boot:select: closing boot module globus
> > n-1<10616> ssi:boot:select: finalizing boot module slurm
> > n-1<10616> ssi:boot:slurm: finalizing
> > n-1<10616> ssi:boot:select: closing boot module slurm
> > n-1<10616> ssi:boot:select: selected boot module rsh
> > n-1<10616> ssi:boot:base: looking for boot schema in following
> > directories:
> > n-1<10616> ssi:boot:base: <current directory>
> > n-1<10616> ssi:boot:base: $TROLLIUSHOME/etc
> > n-1<10616> ssi:boot:base: $LAMHOME/etc
> > n-1<10616> ssi:boot:base: /etc/lam
> > n-1<10616> ssi:boot:base: looking for boot schema file:
> > n-1<10616> ssi:boot:base: machinefile.all
> > n-1<10616> ssi:boot:base: found boot schema: machinefile.all
> > n-1<10616> ssi:boot:rsh: found the following hosts:
> > n-1<10616> ssi:boot:rsh: n0 168.158.222.80 (cpu=1)
> > n-1<10616> ssi:boot:rsh: n1 168.158.222.152 (cpu=1)
> > n-1<10616> ssi:boot:rsh: resolved hosts:
> > n-1<10616> ssi:boot:rsh: n0 168.158.222.80 --> 168.158.222.80
> > (origin)
> > n-1<10616> ssi:boot:rsh: n1 168.158.222.152 --> 168.158.222.152
> > n-1<10616> ssi:boot:rsh: starting RTE procs
> > n-1<10616> ssi:boot:base:linear: starting
> > n-1<10616> ssi:boot:base:linear: booting n0 (168.158.222.80)
> > n-1<10616> ssi:boot:rsh: starting wipe on (168.158.222.80)
> > n-1<10616> ssi:boot:rsh: starting on n0 (168.158.222.80): tkill -d
> > n-1<10616> ssi:boot:rsh: launching locally
> > tkill: setting prefix to (null)
> > tkill: setting suffix to (null)
> > tkill: got killname back: /tmp/lam-parallel_at_Solar1202.localdomain/
> > lam-killfile
> > tkill: removing socket file ...
> > tkill: socket file: /tmp/lam-parallel_at_Solar1202.localdomain/lam-
> > kernel-socketd
> > tkill: removing IO daemon socket file ...
> > tkill: IO daemon socket file: /tmp/lam-
> > parallel_at_Solar1202.localdomain/lam-io-socket
> > tkill: f_kill = "/tmp/lam-parallel_at_Solar1202.localdomain/lam-killfile"
> > tkill: killing LAM...
> > tkill: killing PID (SIGHUP) 10603 ...
> > tkill: killed
> > tkill: all finished
> > n-1<10616> ssi:boot:rsh: successfully launched on n0 (168.158.222.80)
> > n-1<10616> ssi:boot:base:linear: booting n1 (168.158.222.152)
> > n-1<10616> ssi:boot:rsh: starting wipe on (168.158.222.152)
> > n-1<10616> ssi:boot:rsh: starting on n1 (168.158.222.152): tkill -d
> > n-1<10616> ssi:boot:rsh: launching remotely
> > n-1<10616> ssi:boot:rsh: attempting to execute: /usr/bin/ssh -x -a
> > 168.158.222.152 -n 'echo $SHELL'
> > n-1<10616> ssi:boot:rsh: remote shell /bin/bash
> > n-1<10616> ssi:boot:rsh: attempting to execute: /usr/bin/ssh -x -a
> > 168.158.222.152 -n tkill -d
> > tkill: setting prefix to (null)
> > tkill: setting suffix to (null)
> > tkill: got killname back: /tmp/lam-parallel_at_Solar1204.localdomain/
> > lam-killfile
> > tkill: removing socket file ...
> > tkill: socket file: /tmp/lam-parallel_at_Solar1204.localdomain/lam-
> > kernel-socketd
> > tkill: removing IO daemon socket file ...
> > tkill: IO daemon socket file: /tmp/lam-
> > parallel_at_Solar1204.localdomain/lam-io-socket
> > tkill: f_kill = "/tmp/lam-parallel_at_Solar1204.localdomain/lam-killfile"
> > tkill: killing LAM...
> > tkill: killing PID (SIGHUP) 15812 ...
> > tkill: killed
> > tkill: all finished
> > n-1<10616> ssi:boot:rsh: successfully launched on n1 (168.158.222.152)
> > n-1<10616> ssi:boot:base:linear: finished
> > n-1<10616> ssi:boot:rsh: all RTE procs started
> > n-1<10616> ssi:boot:rsh: finalizing
> > n-1<10616> ssi:boot: Closing
> > lamboot did NOT complete successfully
> > [parallel_at_Solar1202 ~]$
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>