LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-07-13 21:45:10


In the output that you show below, did you capture the stderr as well?
I don't see anything that would cause an error in your output --
typically, when something like this happens, an error message was
emitted by ssh (or something related) and it was sent to stderr.

Did anything like that occur? (sometimes the error message is hard to
find -- it may be embedded somewhere in the middle of the lamboot
output)

On Jul 11, 2005, at 3:27 PM, Jejo Koola wrote:

> Sorry for beating a dead horse, but I've tried everything I can think
> of.
>
> Tryiing to lamboot between a linux machine:
>
> Linux sophia 2.4.20-20.8smp #1 SMP Mon Aug 18 14:39:22 EDT 2003 i686
> i686 i386 GNU/Linux
>
> and
>
> a sun-solaris machine:
>
> Linux sophia 2.4.20-20.8smp #1 SMP Mon Aug 18 14:39:22 EDT 2003 i686
> i686 i386 GNU/Linux
>
> They are both running:
>
> LAM 7.1.1/MPI 2 C++/ROMIO
>
> Naturally, I have followed everything in the manual, FAQ, and tried to
> scour the list. Essentially, lamboot fails at the command:
>
> ssh -x athena -n 'echo $SHELL'
>
> when trying to lamboot from the linux machine, and vice-versa when
> trying to lamboot from the solaris machine. ssh used on the command
> line works just fine from either machine. From the linux machine,
> here is the result of using ssh on the command line:
>
> -bash-2.05b$ ssh -x athena.musc.edu -n 'echo $SHELL'
> /bin/csh
> -bash-2.05b$
>
> Here is a further example of ssh on the command line from the linux
> machine:
>
> -bash-2.05b$ ssh -x athena.musc.edu -n 'tkill -N'
> -bash-2.05b$
>
> Is there something else that can go wrong when lamboot tries to use
> ssh versus just using ssh on the command line? Thanks for any
> insight.
>
> Regards,
>
> Jejo Koola
>
> Here is the output of lamboot -d:
>
> -bash-2.05b$ lamboot -d
> n-1<31283> ssi:boot:open: opening
> n-1<31283> ssi:boot:open: opening boot module globus
> n-1<31283> ssi:boot:open: opened boot module globus
> n-1<31283> ssi:boot:open: opening boot module rsh
> n-1<31283> ssi:boot:open: opened boot module rsh
> n-1<31283> ssi:boot:open: opening boot module slurm
> n-1<31283> ssi:boot:open: opened boot module slurm
> n-1<31283> ssi:boot:select: initializing boot module globus
> n-1<31283> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<31283> ssi:boot:select: boot module not available: globus
> n-1<31283> ssi:boot:select: initializing boot module rsh
> n-1<31283> ssi:boot:rsh: module initializing
> n-1<31283> ssi:boot:rsh:agent: ssh -x
> n-1<31283> ssi:boot:rsh:username: <same>
> n-1<31283> ssi:boot:rsh:verbose: 1000
> n-1<31283> ssi:boot:rsh:algorithm: linear
> n-1<31283> ssi:boot:rsh:no_n: 0
> n-1<31283> ssi:boot:rsh:no_profile: 0
> n-1<31283> ssi:boot:rsh:fast: 0
> n-1<31283> ssi:boot:rsh:ignore_stderr: 0
> n-1<31283> ssi:boot:rsh:priority: 10
> n-1<31283> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<31283> ssi:boot:select: initializing boot module slurm
> n-1<31283> ssi:boot:slurm: not running under SLURM
> n-1<31283> ssi:boot:select: boot module not available: slurm
> n-1<31283> ssi:boot:select: finalizing boot module globus
> n-1<31283> ssi:boot:globus: finalizing
> n-1<31283> ssi:boot:select: closing boot module globus
> n-1<31283> ssi:boot:select: finalizing boot module slurm
> n-1<31283> ssi:boot:slurm: finalizing
> n-1<31283> ssi:boot:select: closing boot module slurm
> n-1<31283> ssi:boot:select: selected boot module rsh
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<31283> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<31283> ssi:boot:base: <current directory>
> n-1<31283> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<31283> ssi:boot:base: $LAMHOME/etc
> n-1<31283> ssi:boot:base: /home/koola/lammpi-7.1.1/etc
> n-1<31283> ssi:boot:base: looking for boot schema file:
> n-1<31283> ssi:boot:base: lam-bhost.def
> n-1<31283> ssi:boot:base: found boot schema:
> /home/koola/lammpi-7.1.1/etc/lam-bhost.def
> n-1<31283> ssi:boot:rsh: found the following hosts:
> n-1<31283> ssi:boot:rsh: n0 sophia.musc.edu (cpu=1)
> n-1<31283> ssi:boot:rsh: n1 athena.musc.edu (cpu=1)
> n-1<31283> ssi:boot:rsh: resolved hosts:
> n-1<31283> ssi:boot:rsh: n0 sophia.musc.edu --> 128.23.19.53 (origin)
> n-1<31283> ssi:boot:rsh: n1 athena.musc.edu --> 128.23.19.23
> n-1<31283> ssi:boot:rsh: starting RTE procs
> n-1<31283> ssi:boot:base:linear: starting
> n-1<31283> ssi:boot:base:server: opening server TCP socket
> n-1<31283> ssi:boot:base:server: opened port 33594
> n-1<31283> ssi:boot:base:linear: booting n0 (sophia.musc.edu)
> n-1<31283> ssi:boot:rsh: starting lamd on (sophia.musc.edu)
> n-1<31283> ssi:boot:rsh: starting on n0 (sophia.musc.edu): hboot -t -c
> lam-conf.lamd -d -I -H 128.23.19.53 -P 33594 -n 0 -o 0
> n-1<31283> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-koola_at_sophia/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-koola_at_sophia/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-koola_at_sophia/lam-io-socket
> tkill: f_kill = "/tmp/lam-koola_at_sophia/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-koola_at_sophia/lam-killfile"
> hboot: booting...
> hboot: fork /home/koola/bin/lammpi-7.1.1/bin/lamd
> [1] 31286 lamd -H 128.23.19.53 -P 33594 -n 0 -o 0 -d
> hboot: attempting to execute
> n-1<31283> ssi:boot:rsh: successfully launched on n0 (sophia.musc.edu)
> n-1<31283> ssi:boot:base:server: expecting connection from finite list
> n-1<31286> ssi:boot:open: opening
> n-1<31286> ssi:boot:open: opening boot module globus
> n-1<31286> ssi:boot:open: opened boot module globus
> n-1<31286> ssi:boot:open: opening boot module rsh
> n-1<31286> ssi:boot:open: opened boot module rsh
> n-1<31286> ssi:boot:open: opening boot module slurm
> n-1<31286> ssi:boot:open: opened boot module slurm
> n-1<31286> ssi:boot:select: initializing boot module globus
> n-1<31286> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<31286> ssi:boot:select: boot module not available: globus
> n-1<31286> ssi:boot:select: initializing boot module rsh
> n-1<31286> ssi:boot:rsh: module initializing
> n-1<31286> ssi:boot:rsh:agent: ssh -x
> n-1<31286> ssi:boot:rsh:username: <same>
> n-1<31286> ssi:boot:rsh:verbose: 1000
> n-1<31286> ssi:boot:rsh:algorithm: linear
> n-1<31286> ssi:boot:rsh:no_n: 0
> n-1<31286> ssi:boot:rsh:no_profile: 0
> n-1<31286> ssi:boot:rsh:fast: 0
> n-1<31286> ssi:boot:rsh:ignore_stderr: 0
> n-1<31286> ssi:boot:rsh:priority: 10
> n-1<31286> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<31286> ssi:boot:select: initializing boot module slurm
> n-1<31286> ssi:boot:slurm: not running under SLURM
> n-1<31286> ssi:boot:select: boot module not available: slurm
> n-1<31286> ssi:boot:select: finalizing boot module globus
> n-1<31286> ssi:boot:globus: finalizing
> n-1<31286> ssi:boot:select: closing boot module globus
> n-1<31286> ssi:boot:select: finalizing boot module slurm
> n-1<31286> ssi:boot:slurm: finalizing
> n-1<31286> ssi:boot:select: closing boot module slurm
> n-1<31286> ssi:boot:select: selected boot module rsh
> n-1<31286> ssi:boot:send_lamd: getting node ID from command line
> n-1<31286> ssi:boot:send_lamd: getting agent haddr from command line
> n-1<31286> ssi:boot:send_lamd: getting agent port from command line
> n-1<31286> ssi:boot:send_lamd: getting node ID from command line
> n-1<31286> ssi:boot:send_lamd: connecting to 128.23.19.53:33594, node
> id 0
> n-1<31286> ssi:boot:send_lamd: sending dli_port 32807
> n-1<31283> ssi:boot:base:server: got connection from 128.23.19.53
> n-1<31283> ssi:boot:base:server: this connection is expected (n0)
> n-1<31283> ssi:boot:base:server: remote lamd is at 128.23.19.53:32807
> n-1<31283> ssi:boot:base:linear: booting n1 (athena.musc.edu)
> n-1<31283> ssi:boot:rsh: starting lamd on (athena.musc.edu)
> n-1<31283> ssi:boot:rsh: starting on n1 (athena.musc.edu): hboot -t -c
> lam-conf.lamd -d -s -I "-H 128.23.19.53 -P 33594 -n 1 -o 0"
> n-1<31283> ssi:boot:rsh: launching remotely
> n-1<31283> ssi:boot:rsh: attempting to execute: ssh -x athena.musc.edu
> -n 'echo $SHELL'
> -----------------------------------------------------------------------
> ------
> LAM failed to execute a process on the remote node "athena.musc.edu".
> LAM was not trying to invoke any LAM-specific commands yet -- we were
> simply trying to determine what shell was being used on the remote
> host.
>
> LAM tried to use the remote agent command "ssh"
> to invoke "echo $SHELL" on the remote node.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> This usually indicates an authentication problem with the remote
> agent, some other configuration type of error in your .cshrc or
> .profile file, or you were unable to executable a command on the
> remote node for some other reason. The following is a list of items
> that you should check on the remote node:
>
> - You have an account and can login to the remote machine
> - Incorrect permissions on your home directory (should
> probably be 0755)
> - Incorrect permissions on your $HOME/.rhosts file (if you are
> using rsh -- they should probably be 0644)
> - You have an entry in the remote $HOME/.rhosts file (if you
> are using rsh) for the machine and username that you are
> running from
> - Your .cshrc/.profile must not print anything out to the
> standard error
> - Your .cshrc/.profile should set a correct TERM type
> - Your .cshrc/.profile should set the SHELL environment
> variable to your default shell
>
> Try invoking the following command at the unix command line:
>
> ssh -x athena.musc.edu -n 'echo $SHELL'
>
> You will need to configure your local setup such that you will *not*
> be prompted for a password to invoke this command on the remote node.
> No output should be printed from the remote node before the output of
> the command is displayed.
>
> When you can get this command to execute successfully by hand, LAM
> will probably be able to function properly.
> -----------------------------------------------------------------------
> ------
> n-1<31283> ssi:boot:base:linear: Failed to boot n1 (athena.musc.edu)
> n-1<31283> ssi:boot:base:server: closing server socket
> n-1<31283> ssi:boot:base:linear: aborted!
> n-1<31288> ssi:boot:open: opening
> n-1<31288> ssi:boot:open: opening boot module globus
> n-1<31288> ssi:boot:open: opened boot module globus
> n-1<31288> ssi:boot:open: opening boot module rsh
> n-1<31288> ssi:boot:open: opened boot module rsh
> n-1<31288> ssi:boot:open: opening boot module slurm
> n-1<31288> ssi:boot:open: opened boot module slurm
> n-1<31288> ssi:boot:select: initializing boot module globus
> n-1<31288> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<31288> ssi:boot:select: boot module not available: globus
> n-1<31288> ssi:boot:select: initializing boot module rsh
> n-1<31288> ssi:boot:rsh: module initializing
> n-1<31288> ssi:boot:rsh:agent: ssh -x
> n-1<31288> ssi:boot:rsh:username: <same>
> n-1<31288> ssi:boot:rsh:verbose: 1000
> n-1<31288> ssi:boot:rsh:algorithm: linear
> n-1<31288> ssi:boot:rsh:no_n: 0
> n-1<31288> ssi:boot:rsh:no_profile: 0
> n-1<31288> ssi:boot:rsh:fast: 0
> n-1<31288> ssi:boot:rsh:ignore_stderr: 0
> n-1<31288> ssi:boot:rsh:priority: 10
> n-1<31288> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<31288> ssi:boot:select: initializing boot module slurm
> n-1<31288> ssi:boot:slurm: not running under SLURM
> n-1<31288> ssi:boot:select: boot module not available: slurm
> n-1<31288> ssi:boot:select: finalizing boot module globus
> n-1<31288> ssi:boot:globus: finalizing
> n-1<31288> ssi:boot:select: closing boot module globus
> n-1<31288> ssi:boot:select: finalizing boot module slurm
> n-1<31288> ssi:boot:slurm: finalizing
> n-1<31288> ssi:boot:select: closing boot module slurm
> n-1<31288> ssi:boot:select: selected boot module rsh
> n-1<31288> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<31288> ssi:boot:base: <current directory>
> n-1<31288> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<31288> ssi:boot:base: $LAMHOME/etc
> n-1<31288> ssi:boot:base: /home/koola/lammpi-7.1.1/etc
> n-1<31288> ssi:boot:base: looking for boot schema file:
> n-1<31288> ssi:boot:base: lam-bhost.def
> n-1<31288> ssi:boot:base: found boot schema:
> /home/koola/lammpi-7.1.1/etc/lam-bhost.def
> n-1<31288> ssi:boot:rsh: found the following hosts:
> n-1<31288> ssi:boot:rsh: n0 sophia.musc.edu (cpu=1)
> n-1<31288> ssi:boot:rsh: n1 athena.musc.edu (cpu=1)
> n-1<31288> ssi:boot:rsh: resolved hosts:
> n-1<31288> ssi:boot:rsh: n0 sophia.musc.edu --> 128.23.19.53 (origin)
> n-1<31288> ssi:boot:rsh: n1 athena.musc.edu --> 128.23.19.23
> n-1<31288> ssi:boot:rsh: starting RTE procs
> n-1<31288> ssi:boot:base:linear: starting
> n-1<31288> ssi:boot:base:linear: booting n0 (sophia.musc.edu)
> n-1<31288> ssi:boot:rsh: starting wipe on (sophia.musc.edu)
> n-1<31288> ssi:boot:rsh: starting on n0 (sophia.musc.edu): tkill -d
> n-1<31288> ssi:boot:rsh: launching locally
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-koola_at_sophia/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-koola_at_sophia/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-koola_at_sophia/lam-io-socket
> tkill: f_kill = "/tmp/lam-koola_at_sophia/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 31286 ...
> tkill: killed
> tkill: all finished
> n-1<31288> ssi:boot:rsh: successfully launched on n0 (sophia.musc.edu)
> n-1<31288> ssi:boot:base:linear: booting n1 (athena.musc.edu)
> n-1<31288> ssi:boot:rsh: starting wipe on (athena.musc.edu)
> n-1<31288> ssi:boot:rsh: starting on n1 (athena.musc.edu): tkill -d
> n-1<31288> ssi:boot:rsh: launching remotely
> n-1<31288> ssi:boot:rsh: attempting to execute: ssh -x athena.musc.edu
> -n 'echo $SHELL'
> -----------------------------------------------------------------------
> ------
> LAM failed to execute a process on the remote node "athena.musc.edu".
> LAM was not trying to invoke any LAM-specific commands yet -- we were
> simply trying to determine what shell was being used on the remote
> host.
>
> LAM tried to use the remote agent command "ssh"
> to invoke "echo $SHELL" on the remote node.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> This usually indicates an authentication problem with the remote
> agent, some other configuration type of error in your .cshrc or
> .profile file, or you were unable to executable a command on the
> remote node for some other reason. The following is a list of items
> that you should check on the remote node:
>
> - You have an account and can login to the remote machine
> - Incorrect permissions on your home directory (should
> probably be 0755)
> - Incorrect permissions on your $HOME/.rhosts file (if you are
> using rsh -- they should probably be 0644)
> - You have an entry in the remote $HOME/.rhosts file (if you
> are using rsh) for the machine and username that you are
> running from
> - Your .cshrc/.profile must not print anything out to the
> standard error
> - Your .cshrc/.profile should set a correct TERM type
> - Your .cshrc/.profile should set the SHELL environment
> variable to your default shell
>
> Try invoking the following command at the unix command line:
>
> ssh -x athena.musc.edu -n 'echo $SHELL'
>
> You will need to configure your local setup such that you will *not*
> be prompted for a password to invoke this command on the remote node.
> No output should be printed from the remote node before the output of
> the command is displayed.
>
> When you can get this command to execute successfully by hand, LAM
> will probably be able to function properly.
> -----------------------------------------------------------------------
> ------
> n-1<31288> ssi:boot:base:linear: Failed to boot n1 (athena.musc.edu)
> n-1<31288> ssi:boot:base:linear: aborted!
> lamboot did NOT complete successfully
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/