LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jejo Koola (jdkoola_at_[hidden])
Date: 2005-07-14 10:01:41


There was a type in my email. The lamboot command I tried was:

 lamboot -d -ssi boot_rsh_ignore_stderr 1

Jejo

On 7/14/05, Jejo Koola <jdkoola_at_[hidden]> wrote:
> Hi Jeff,
>
> Thanks for your reply. To answer your questions:
>
> 1. Yes, I captured all of the output generated by the command: lamboot
> -d. I assume that it will output all stderr messages on the host and
> remote nodes by default.
>
> 2. In looking through the output, I do not find any error messages.
>
> 3. I did find it odd that it tried to execute ssh 'echo $SHELL' twice.
> You will see it if you look at the output of lamboot. And it also
> printed out lamboot's verbose error/help message twice. Is it
> supposed to try that twice?, or is that indicative or something wrong?
>
> 4. I ran lamboot with the boot_rsh_ignore_stderr ssi option, via the
> following command:
>
> lamboot -d -si boot_rsh_ignore_stderr 1
>
> It still failed, and the output was identical (except of course the
> port numbers were different this time around, and this time around the
> ignore_stderr option was set to 1)
>
> Thanks for your insights.
>
> Jejo
>
>
> On 7/13/05, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> > In the output that you show below, did you capture the stderr as well?
> > I don't see anything that would cause an error in your output --
> > typically, when something like this happens, an error message was
> > emitted by ssh (or something related) and it was sent to stderr.
> >
> > Did anything like that occur? (sometimes the error message is hard to
> > find -- it may be embedded somewhere in the middle of the lamboot
> > output)
> >
> >
> > On Jul 11, 2005, at 3:27 PM, Jejo Koola wrote:
> >
> > > Sorry for beating a dead horse, but I've tried everything I can think
> > > of.
> > >
> > > Tryiing to lamboot between a linux machine:
> > >
> > > Linux sophia 2.4.20-20.8smp #1 SMP Mon Aug 18 14:39:22 EDT 2003 i686
> > > i686 i386 GNU/Linux
> > >
> > > and
> > >
> > > a sun-solaris machine:
> > >
> > > Linux sophia 2.4.20-20.8smp #1 SMP Mon Aug 18 14:39:22 EDT 2003 i686
> > > i686 i386 GNU/Linux
> > >
> > > They are both running:
> > >
> > > LAM 7.1.1/MPI 2 C++/ROMIO
> > >
> > > Naturally, I have followed everything in the manual, FAQ, and tried to
> > > scour the list. Essentially, lamboot fails at the command:
> > >
> > > ssh -x athena -n 'echo $SHELL'
> > >
> > > when trying to lamboot from the linux machine, and vice-versa when
> > > trying to lamboot from the solaris machine. ssh used on the command
> > > line works just fine from either machine. From the linux machine,
> > > here is the result of using ssh on the command line:
> > >
> > > -bash-2.05b$ ssh -x athena.musc.edu -n 'echo $SHELL'
> > > /bin/csh
> > > -bash-2.05b$
> > >
> > > Here is a further example of ssh on the command line from the linux
> > > machine:
> > >
> > > -bash-2.05b$ ssh -x athena.musc.edu -n 'tkill -N'
> > > -bash-2.05b$
> > >
> > > Is there something else that can go wrong when lamboot tries to use
> > > ssh versus just using ssh on the command line? Thanks for any
> > > insight.
> > >
> > > Regards,
> > >
> > > Jejo Koola
> > >
> > > Here is the output of lamboot -d:
> > >
> > > -bash-2.05b$ lamboot -d
> > > n-1<31283> ssi:boot:open: opening
> > > n-1<31283> ssi:boot:open: opening boot module globus
> > > n-1<31283> ssi:boot:open: opened boot module globus
> > > n-1<31283> ssi:boot:open: opening boot module rsh
> > > n-1<31283> ssi:boot:open: opened boot module rsh
> > > n-1<31283> ssi:boot:open: opening boot module slurm
> > > n-1<31283> ssi:boot:open: opened boot module slurm
> > > n-1<31283> ssi:boot:select: initializing boot module globus
> > > n-1<31283> ssi:boot:globus: globus-job-run not found, globus boot will
> > > not run
> > > n-1<31283> ssi:boot:select: boot module not available: globus
> > > n-1<31283> ssi:boot:select: initializing boot module rsh
> > > n-1<31283> ssi:boot:rsh: module initializing
> > > n-1<31283> ssi:boot:rsh:agent: ssh -x
> > > n-1<31283> ssi:boot:rsh:username: <same>
> > > n-1<31283> ssi:boot:rsh:verbose: 1000
> > > n-1<31283> ssi:boot:rsh:algorithm: linear
> > > n-1<31283> ssi:boot:rsh:no_n: 0
> > > n-1<31283> ssi:boot:rsh:no_profile: 0
> > > n-1<31283> ssi:boot:rsh:fast: 0
> > > n-1<31283> ssi:boot:rsh:ignore_stderr: 0
> > > n-1<31283> ssi:boot:rsh:priority: 10
> > > n-1<31283> ssi:boot:select: boot module available: rsh, priority: 10
> > > n-1<31283> ssi:boot:select: initializing boot module slurm
> > > n-1<31283> ssi:boot:slurm: not running under SLURM
> > > n-1<31283> ssi:boot:select: boot module not available: slurm
> > > n-1<31283> ssi:boot:select: finalizing boot module globus
> > > n-1<31283> ssi:boot:globus: finalizing
> > > n-1<31283> ssi:boot:select: closing boot module globus
> > > n-1<31283> ssi:boot:select: finalizing boot module slurm
> > > n-1<31283> ssi:boot:slurm: finalizing
> > > n-1<31283> ssi:boot:select: closing boot module slurm
> > > n-1<31283> ssi:boot:select: selected boot module rsh
> > >
> > > LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
> > >
> > > n-1<31283> ssi:boot:base: looking for boot schema in following
> > > directories:
> > > n-1<31283> ssi:boot:base: <current directory>
> > > n-1<31283> ssi:boot:base: $TROLLIUSHOME/etc
> > > n-1<31283> ssi:boot:base: $LAMHOME/etc
> > > n-1<31283> ssi:boot:base: /home/koola/lammpi-7.1.1/etc
> > > n-1<31283> ssi:boot:base: looking for boot schema file:
> > > n-1<31283> ssi:boot:base: lam-bhost.def
> > > n-1<31283> ssi:boot:base: found boot schema:
> > > /home/koola/lammpi-7.1.1/etc/lam-bhost.def
> > > n-1<31283> ssi:boot:rsh: found the following hosts:
> > > n-1<31283> ssi:boot:rsh: n0 sophia.musc.edu (cpu=1)
> > > n-1<31283> ssi:boot:rsh: n1 athena.musc.edu (cpu=1)
> > > n-1<31283> ssi:boot:rsh: resolved hosts:
> > > n-1<31283> ssi:boot:rsh: n0 sophia.musc.edu --> 128.23.19.53 (origin)
> > > n-1<31283> ssi:boot:rsh: n1 athena.musc.edu --> 128.23.19.23
> > > n-1<31283> ssi:boot:rsh: starting RTE procs
> > > n-1<31283> ssi:boot:base:linear: starting
> > > n-1<31283> ssi:boot:base:server: opening server TCP socket
> > > n-1<31283> ssi:boot:base:server: opened port 33594
> > > n-1<31283> ssi:boot:base:linear: booting n0 (sophia.musc.edu)
> > > n-1<31283> ssi:boot:rsh: starting lamd on (sophia.musc.edu)
> > > n-1<31283> ssi:boot:rsh: starting on n0 (sophia.musc.edu): hboot -t -c
> > > lam-conf.lamd -d -I -H 128.23.19.53 -P 33594 -n 0 -o 0
> > > n-1<31283> ssi:boot:rsh: launching locally
> > > hboot: performing tkill
> > > hboot: tkill -d
> > > tkill: setting prefix to (null)
> > > tkill: setting suffix to (null)
> > > tkill: got killname back: /tmp/lam-koola_at_sophia/lam-killfile
> > > tkill: removing socket file ...
> > > tkill: socket file: /tmp/lam-koola_at_sophia/lam-kernel-socketd
> > > tkill: removing IO daemon socket file ...
> > > tkill: IO daemon socket file: /tmp/lam-koola_at_sophia/lam-io-socket
> > > tkill: f_kill = "/tmp/lam-koola_at_sophia/lam-killfile"
> > > tkill: nothing to kill: "/tmp/lam-koola_at_sophia/lam-killfile"
> > > hboot: booting...
> > > hboot: fork /home/koola/bin/lammpi-7.1.1/bin/lamd
> > > [1] 31286 lamd -H 128.23.19.53 -P 33594 -n 0 -o 0 -d
> > > hboot: attempting to execute
> > > n-1<31283> ssi:boot:rsh: successfully launched on n0 (sophia.musc.edu)
> > > n-1<31283> ssi:boot:base:server: expecting connection from finite list
> > > n-1<31286> ssi:boot:open: opening
> > > n-1<31286> ssi:boot:open: opening boot module globus
> > > n-1<31286> ssi:boot:open: opened boot module globus
> > > n-1<31286> ssi:boot:open: opening boot module rsh
> > > n-1<31286> ssi:boot:open: opened boot module rsh
> > > n-1<31286> ssi:boot:open: opening boot module slurm
> > > n-1<31286> ssi:boot:open: opened boot module slurm
> > > n-1<31286> ssi:boot:select: initializing boot module globus
> > > n-1<31286> ssi:boot:globus: globus-job-run not found, globus boot will
> > > not run
> > > n-1<31286> ssi:boot:select: boot module not available: globus
> > > n-1<31286> ssi:boot:select: initializing boot module rsh
> > > n-1<31286> ssi:boot:rsh: module initializing
> > > n-1<31286> ssi:boot:rsh:agent: ssh -x
> > > n-1<31286> ssi:boot:rsh:username: <same>
> > > n-1<31286> ssi:boot:rsh:verbose: 1000
> > > n-1<31286> ssi:boot:rsh:algorithm: linear
> > > n-1<31286> ssi:boot:rsh:no_n: 0
> > > n-1<31286> ssi:boot:rsh:no_profile: 0
> > > n-1<31286> ssi:boot:rsh:fast: 0
> > > n-1<31286> ssi:boot:rsh:ignore_stderr: 0
> > > n-1<31286> ssi:boot:rsh:priority: 10
> > > n-1<31286> ssi:boot:select: boot module available: rsh, priority: 10
> > > n-1<31286> ssi:boot:select: initializing boot module slurm
> > > n-1<31286> ssi:boot:slurm: not running under SLURM
> > > n-1<31286> ssi:boot:select: boot module not available: slurm
> > > n-1<31286> ssi:boot:select: finalizing boot module globus
> > > n-1<31286> ssi:boot:globus: finalizing
> > > n-1<31286> ssi:boot:select: closing boot module globus
> > > n-1<31286> ssi:boot:select: finalizing boot module slurm
> > > n-1<31286> ssi:boot:slurm: finalizing
> > > n-1<31286> ssi:boot:select: closing boot module slurm
> > > n-1<31286> ssi:boot:select: selected boot module rsh
> > > n-1<31286> ssi:boot:send_lamd: getting node ID from command line
> > > n-1<31286> ssi:boot:send_lamd: getting agent haddr from command line
> > > n-1<31286> ssi:boot:send_lamd: getting agent port from command line
> > > n-1<31286> ssi:boot:send_lamd: getting node ID from command line
> > > n-1<31286> ssi:boot:send_lamd: connecting to 128.23.19.53:33594, node
> > > id 0
> > > n-1<31286> ssi:boot:send_lamd: sending dli_port 32807
> > > n-1<31283> ssi:boot:base:server: got connection from 128.23.19.53
> > > n-1<31283> ssi:boot:base:server: this connection is expected (n0)
> > > n-1<31283> ssi:boot:base:server: remote lamd is at 128.23.19.53:32807
> > > n-1<31283> ssi:boot:base:linear: booting n1 (athena.musc.edu)
> > > n-1<31283> ssi:boot:rsh: starting lamd on (athena.musc.edu)
> > > n-1<31283> ssi:boot:rsh: starting on n1 (athena.musc.edu): hboot -t -c
> > > lam-conf.lamd -d -s -I "-H 128.23.19.53 -P 33594 -n 1 -o 0"
> > > n-1<31283> ssi:boot:rsh: launching remotely
> > > n-1<31283> ssi:boot:rsh: attempting to execute: ssh -x athena.musc.edu
> > > -n 'echo $SHELL'
> > > -----------------------------------------------------------------------
> > > ------
> > > LAM failed to execute a process on the remote node "athena.musc.edu".
> > > LAM was not trying to invoke any LAM-specific commands yet -- we were
> > > simply trying to determine what shell was being used on the remote
> > > host.
> > >
> > > LAM tried to use the remote agent command "ssh"
> > > to invoke "echo $SHELL" on the remote node.
> > >
> > > *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> > > *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> > > *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> > > *** MAILING LIST.
> > >
> > > This usually indicates an authentication problem with the remote
> > > agent, some other configuration type of error in your .cshrc or
> > > .profile file, or you were unable to executable a command on the
> > > remote node for some other reason. The following is a list of items
> > > that you should check on the remote node:
> > >
> > > - You have an account and can login to the remote machine
> > > - Incorrect permissions on your home directory (should
> > > probably be 0755)
> > > - Incorrect permissions on your $HOME/.rhosts file (if you are
> > > using rsh -- they should probably be 0644)
> > > - You have an entry in the remote $HOME/.rhosts file (if you
> > > are using rsh) for the machine and username that you are
> > > running from
> > > - Your .cshrc/.profile must not print anything out to the
> > > standard error
> > > - Your .cshrc/.profile should set a correct TERM type
> > > - Your .cshrc/.profile should set the SHELL environment
> > > variable to your default shell
> > >
> > > Try invoking the following command at the unix command line:
> > >
> > > ssh -x athena.musc.edu -n 'echo $SHELL'
> > >
> > > You will need to configure your local setup such that you will *not*
> > > be prompted for a password to invoke this command on the remote node.
> > > No output should be printed from the remote node before the output of
> > > the command is displayed.
> > >
> > > When you can get this command to execute successfully by hand, LAM
> > > will probably be able to function properly.
> > > -----------------------------------------------------------------------
> > > ------
> > > n-1<31283> ssi:boot:base:linear: Failed to boot n1 (athena.musc.edu)
> > > n-1<31283> ssi:boot:base:server: closing server socket
> > > n-1<31283> ssi:boot:base:linear: aborted!
> > > n-1<31288> ssi:boot:open: opening
> > > n-1<31288> ssi:boot:open: opening boot module globus
> > > n-1<31288> ssi:boot:open: opened boot module globus
> > > n-1<31288> ssi:boot:open: opening boot module rsh
> > > n-1<31288> ssi:boot:open: opened boot module rsh
> > > n-1<31288> ssi:boot:open: opening boot module slurm
> > > n-1<31288> ssi:boot:open: opened boot module slurm
> > > n-1<31288> ssi:boot:select: initializing boot module globus
> > > n-1<31288> ssi:boot:globus: globus-job-run not found, globus boot will
> > > not run
> > > n-1<31288> ssi:boot:select: boot module not available: globus
> > > n-1<31288> ssi:boot:select: initializing boot module rsh
> > > n-1<31288> ssi:boot:rsh: module initializing
> > > n-1<31288> ssi:boot:rsh:agent: ssh -x
> > > n-1<31288> ssi:boot:rsh:username: <same>
> > > n-1<31288> ssi:boot:rsh:verbose: 1000
> > > n-1<31288> ssi:boot:rsh:algorithm: linear
> > > n-1<31288> ssi:boot:rsh:no_n: 0
> > > n-1<31288> ssi:boot:rsh:no_profile: 0
> > > n-1<31288> ssi:boot:rsh:fast: 0
> > > n-1<31288> ssi:boot:rsh:ignore_stderr: 0
> > > n-1<31288> ssi:boot:rsh:priority: 10
> > > n-1<31288> ssi:boot:select: boot module available: rsh, priority: 10
> > > n-1<31288> ssi:boot:select: initializing boot module slurm
> > > n-1<31288> ssi:boot:slurm: not running under SLURM
> > > n-1<31288> ssi:boot:select: boot module not available: slurm
> > > n-1<31288> ssi:boot:select: finalizing boot module globus
> > > n-1<31288> ssi:boot:globus: finalizing
> > > n-1<31288> ssi:boot:select: closing boot module globus
> > > n-1<31288> ssi:boot:select: finalizing boot module slurm
> > > n-1<31288> ssi:boot:slurm: finalizing
> > > n-1<31288> ssi:boot:select: closing boot module slurm
> > > n-1<31288> ssi:boot:select: selected boot module rsh
> > > n-1<31288> ssi:boot:base: looking for boot schema in following
> > > directories:
> > > n-1<31288> ssi:boot:base: <current directory>
> > > n-1<31288> ssi:boot:base: $TROLLIUSHOME/etc
> > > n-1<31288> ssi:boot:base: $LAMHOME/etc
> > > n-1<31288> ssi:boot:base: /home/koola/lammpi-7.1.1/etc
> > > n-1<31288> ssi:boot:base: looking for boot schema file:
> > > n-1<31288> ssi:boot:base: lam-bhost.def
> > > n-1<31288> ssi:boot:base: found boot schema:
> > > /home/koola/lammpi-7.1.1/etc/lam-bhost.def
> > > n-1<31288> ssi:boot:rsh: found the following hosts:
> > > n-1<31288> ssi:boot:rsh: n0 sophia.musc.edu (cpu=1)
> > > n-1<31288> ssi:boot:rsh: n1 athena.musc.edu (cpu=1)
> > > n-1<31288> ssi:boot:rsh: resolved hosts:
> > > n-1<31288> ssi:boot:rsh: n0 sophia.musc.edu --> 128.23.19.53 (origin)
> > > n-1<31288> ssi:boot:rsh: n1 athena.musc.edu --> 128.23.19.23
> > > n-1<31288> ssi:boot:rsh: starting RTE procs
> > > n-1<31288> ssi:boot:base:linear: starting
> > > n-1<31288> ssi:boot:base:linear: booting n0 (sophia.musc.edu)
> > > n-1<31288> ssi:boot:rsh: starting wipe on (sophia.musc.edu)
> > > n-1<31288> ssi:boot:rsh: starting on n0 (sophia.musc.edu): tkill -d
> > > n-1<31288> ssi:boot:rsh: launching locally
> > > tkill: setting prefix to (null)
> > > tkill: setting suffix to (null)
> > > tkill: got killname back: /tmp/lam-koola_at_sophia/lam-killfile
> > > tkill: removing socket file ...
> > > tkill: socket file: /tmp/lam-koola_at_sophia/lam-kernel-socketd
> > > tkill: removing IO daemon socket file ...
> > > tkill: IO daemon socket file: /tmp/lam-koola_at_sophia/lam-io-socket
> > > tkill: f_kill = "/tmp/lam-koola_at_sophia/lam-killfile"
> > > tkill: killing LAM...
> > > tkill: killing PID (SIGHUP) 31286 ...
> > > tkill: killed
> > > tkill: all finished
> > > n-1<31288> ssi:boot:rsh: successfully launched on n0 (sophia.musc.edu)
> > > n-1<31288> ssi:boot:base:linear: booting n1 (athena.musc.edu)
> > > n-1<31288> ssi:boot:rsh: starting wipe on (athena.musc.edu)
> > > n-1<31288> ssi:boot:rsh: starting on n1 (athena.musc.edu): tkill -d
> > > n-1<31288> ssi:boot:rsh: launching remotely
> > > n-1<31288> ssi:boot:rsh: attempting to execute: ssh -x athena.musc.edu
> > > -n 'echo $SHELL'
> > > -----------------------------------------------------------------------
> > > ------
> > > LAM failed to execute a process on the remote node "athena.musc.edu".
> > > LAM was not trying to invoke any LAM-specific commands yet -- we were
> > > simply trying to determine what shell was being used on the remote
> > > host.
> > >
> > > LAM tried to use the remote agent command "ssh"
> > > to invoke "echo $SHELL" on the remote node.
> > >
> > > *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> > > *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> > > *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> > > *** MAILING LIST.
> > >
> > > This usually indicates an authentication problem with the remote
> > > agent, some other configuration type of error in your .cshrc or
> > > .profile file, or you were unable to executable a command on the
> > > remote node for some other reason. The following is a list of items
> > > that you should check on the remote node:
> > >
> > > - You have an account and can login to the remote machine
> > > - Incorrect permissions on your home directory (should
> > > probably be 0755)
> > > - Incorrect permissions on your $HOME/.rhosts file (if you are
> > > using rsh -- they should probably be 0644)
> > > - You have an entry in the remote $HOME/.rhosts file (if you
> > > are using rsh) for the machine and username that you are
> > > running from
> > > - Your .cshrc/.profile must not print anything out to the
> > > standard error
> > > - Your .cshrc/.profile should set a correct TERM type
> > > - Your .cshrc/.profile should set the SHELL environment
> > > variable to your default shell
> > >
> > > Try invoking the following command at the unix command line:
> > >
> > > ssh -x athena.musc.edu -n 'echo $SHELL'
> > >
> > > You will need to configure your local setup such that you will *not*
> > > be prompted for a password to invoke this command on the remote node.
> > > No output should be printed from the remote node before the output of
> > > the command is displayed.
> > >
> > > When you can get this command to execute successfully by hand, LAM
> > > will probably be able to function properly.
> > > -----------------------------------------------------------------------
> > > ------
> > > n-1<31288> ssi:boot:base:linear: Failed to boot n1 (athena.musc.edu)
> > > n-1<31288> ssi:boot:base:linear: aborted!
> > > lamboot did NOT complete successfully
> > >
> > > _______________________________________________
> > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> > >
> >
> > --
> > {+} Jeff Squyres
> > {+} jsquyres_at_[hidden]
> > {+} http://www.lam-mpi.org/
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>