LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-11-22 17:23:04


Yes, this is a known problem in 7.1.1 (the main issue is the missing
space before "]") and has been fixed in the 7.1.2 beta.

Sorry for the confusion. :-(

On Nov 22, 2004, at 5:12 PM, Jeroen Kleijer wrote:

>
> Hi all,
>
> Perhaps you can help me out because I'm at a loss around here.
> I'm compiling lam-mpi (7.1.1) from source with the PGI compilers and
> configured it with the following options:
> ./configure --prefix=/cadappl/lam/7.1.1-32.sge --with-rsh="ssh -Y"
> The build and installtion proces goes well but whenever I try a
> 'lamboot' it fails with the following messages:
>
> LAM failed to execute a LAM binary on the remote node
> "nlcftcs12.ehv.cft.philips.com".
> Since LAM was already able to determine your remote shell as "hboot",
> it is probable that this is not an authentication problem.
>
> ---
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> LAM tried to use the remote agent command "ssh"
> to invoke the following command:
>
> ssh -Y nlcftcs12.ehv.cft.philips.com -n '( ! [ -e ./.profile]
> || . ./.profile;' tkill -d )
>
> ---
>
> I know that the ssh command should read as follows:
> ssh -Y nlcftcs12.ehv.cft.philips.com -n \
> '( ! [ -e ./.profile ]; . ./.profile ;
> /cadappl/lam/7.1.1-32.sge/bin/tkill )'
>
> The placement of the "'" after the sourcing of my .profile is wrong and
> the ']' should be ended with a ';'.
>
> However, I'm not able to find any location as to where this is
> configured and it seems to be hardcoded.
>
> It also does _not_ appear in a beta version 7.1.2 and lamboot seems to
> work perfectly with this version. (no differences in configure options
> except the prefix)
>
> I've attached the output of lamboot -d lam-hostfile.txt.
>
> Does anyone know what the difference might be between 7.1.1 and 7.1.2
> that explains this behaviour?
>
> Thanx in advance,
>
> Jeroen Kleijer
>
> -----------------------------
>
> nly00281_at_nlcftcs11> lamboot -d
> /cadappl/lam/7.1.1-32.sge/etc/lam-hostmap.txt \
> -prefix /cadappl/lam/7.1.1-32.sge
> n-1<18947> ssi:boot:open: opening
> n-1<18947> ssi:boot:open: opening boot module globus
> n-1<18947> ssi:boot:open: opened boot module globus
> n-1<18947> ssi:boot:open: opening boot module rsh
> n-1<18947> ssi:boot:open: opened boot module rsh
> n-1<18947> ssi:boot:open: opening boot module slurm
> n-1<18947> ssi:boot:open: opened boot module slurm
> n-1<18947> ssi:boot:select: initializing boot module slurm
> n-1<18947> ssi:boot:slurm: not running under SLURM
> n-1<18947> ssi:boot:select: boot module not available: slurm
> n-1<18947> ssi:boot:select: initializing boot module rsh
> n-1<18947> ssi:boot:rsh: module initializing
> n-1<18947> ssi:boot:rsh:agent: ssh -Y
> n-1<18947> ssi:boot:rsh:username: <same>
> n-1<18947> ssi:boot:rsh:verbose: 1000
> n-1<18947> ssi:boot:rsh:algorithm: linear
> n-1<18947> ssi:boot:rsh:no_n: 0
> n-1<18947> ssi:boot:rsh:no_profile: 0
> n-1<18947> ssi:boot:rsh:fast: 0
> n-1<18947> ssi:boot:rsh:ignore_stderr: 0
> n-1<18947> ssi:boot:rsh:priority: 10
> n-1<18947> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<18947> ssi:boot:select: initializing boot module globus
> n-1<18947> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<18947> ssi:boot:select: boot module not available: globus
> n-1<18947> ssi:boot:select: finalizing boot module slurm
> n-1<18947> ssi:boot:slurm: finalizing
> n-1<18947> ssi:boot:select: closing boot module slurm
> n-1<18947> ssi:boot:select: finalizing boot module globus
> n-1<18947> ssi:boot:globus: finalizing
> n-1<18947> ssi:boot:select: closing boot module globus
> n-1<18947> ssi:boot:select: selected boot module rsh
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<18947> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<18947> ssi:boot:base: <current directory>
> n-1<18947> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<18947> ssi:boot:base: $LAMHOME/etc
> n-1<18947> ssi:boot:base: /cadappl/lam/7.1.1-32.sge/etc
> n-1<18947> ssi:boot:base: looking for boot schema file:
> n-1<18947> ssi:boot:base:
> /cadappl/lam/7.1.1-32.sge/etc/lam-hostmap.txt
> n-1<18947> ssi:boot:base: found boot schema:
> /cadappl/lam/7.1.1-32.sge/etc/lam-hostmap.txt
> n-1<18947> ssi:boot:rsh: found the following hosts:
> n-1<18947> ssi:boot:rsh: n0 nlcftcs11.ehv.cft.philips.com (cpu=1)
> n-1<18947> ssi:boot:rsh: n1 nlcftcs12.ehv.cft.philips.com (cpu=1)
> n-1<18947> ssi:boot:rsh: n2 nlcftcs13.ehv.cft.philips.com (cpu=1)
> n-1<18947> ssi:boot:rsh: n3 nlcftcs14.ehv.cft.philips.com (cpu=1)
> n-1<18947> ssi:boot:rsh: resolved hosts:
> n-1<18947> ssi:boot:rsh: n0 nlcftcs11.ehv.cft.philips.com -->
> 130.144.81.142 (origin)
> n-1<18947> ssi:boot:rsh: n1 nlcftcs12.ehv.cft.philips.com -->
> 130.144.81.150
> n-1<18947> ssi:boot:rsh: n2 nlcftcs13.ehv.cft.philips.com -->
> 130.144.81.156
> n-1<18947> ssi:boot:rsh: n3 nlcftcs14.ehv.cft.philips.com -->
> 130.144.81.162
> n-1<18947> ssi:boot:rsh: starting RTE procs
> n-1<18947> ssi:boot:base:linear: starting
> n-1<18947> ssi:boot:base:server: opening server TCP socket
> n-1<18947> ssi:boot:base:server: opened port 32972
> n-1<18947> ssi:boot:base:linear: booting n0
> (nlcftcs11.ehv.cft.philips.com)
> n-1<18947> ssi:boot:rsh: starting lamd on
> (nlcftcs11.ehv.cft.philips.com)
> n-1<18947> ssi:boot:rsh: starting on n0
> (nlcftcs11.ehv.cft.philips.com): hboot -t -c lam-conf.lamd -d -I -H
> 130.144.81.142 -P 32972 -n 0 -o 0
> n-1<18947> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-nly00281_at_nlcftcs11/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-nly00281_at_nlcftcs11/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-nly00281_at_nlcftcs11/lam-io-socket
> tkill: f_kill = "/tmp/lam-nly00281_at_nlcftcs11/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-nly00281_at_nlcftcs11/lam-killfile"
> hboot: booting...
> hboot: fork /cadappl/lam/7.1.1-32.sge/bin/lamd
> hboot: attempting to execute
> [1] 18950 lamd -H 130.144.81.142 -P 32972 -n 0 -o 0 -d
> n-1<18947> ssi:boot:rsh: successfully launched on n0
> (nlcftcs11.ehv.cft.philips.com)
> n-1<18947> ssi:boot:base:server: expecting connection from finite list
> n-1<18950> ssi:boot:open: opening
> n-1<18950> ssi:boot:open: opening boot module globus
> n-1<18950> ssi:boot:open: opened boot module globus
> n-1<18950> ssi:boot:open: opening boot module rsh
> n-1<18950> ssi:boot:open: opened boot module rsh
> n-1<18950> ssi:boot:open: opening boot module slurm
> n-1<18950> ssi:boot:open: opened boot module slurm
> n-1<18950> ssi:boot:select: initializing boot module slurm
> n-1<18950> ssi:boot:slurm: not running under SLURM
> n-1<18950> ssi:boot:select: boot module not available: slurm
> n-1<18950> ssi:boot:select: initializing boot module rsh
> n-1<18950> ssi:boot:rsh: module initializing
> n-1<18950> ssi:boot:rsh:agent: ssh -Y
> n-1<18950> ssi:boot:rsh:username: <same>
> n-1<18950> ssi:boot:rsh:verbose: 1000
> n-1<18950> ssi:boot:rsh:algorithm: linear
> n-1<18950> ssi:boot:rsh:no_n: 0
> n-1<18950> ssi:boot:rsh:no_profile: 0
> n-1<18950> ssi:boot:rsh:fast: 0
> n-1<18950> ssi:boot:rsh:ignore_stderr: 0
> n-1<18950> ssi:boot:rsh:priority: 10
> n-1<18950> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<18950> ssi:boot:select: initializing boot module globus
> n-1<18950> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<18950> ssi:boot:select: boot module not available: globus
> n-1<18950> ssi:boot:select: finalizing boot module slurm
> n-1<18950> ssi:boot:slurm: finalizing
> n-1<18950> ssi:boot:select: closing boot module slurm
> n-1<18950> ssi:boot:select: finalizing boot module globus
> n-1<18950> ssi:boot:globus: finalizing
> n-1<18950> ssi:boot:select: closing boot module globus
> n-1<18950> ssi:boot:select: selected boot module rsh
> n-1<18950> ssi:boot:send_lamd: getting node ID from command line
> n-1<18950> ssi:boot:send_lamd: getting agent haddr from command line
> n-1<18950> ssi:boot:send_lamd: getting agent port from command line
> n-1<18950> ssi:boot:send_lamd: getting node ID from command line
> n-1<18950> ssi:boot:send_lamd: connecting to 130.144.81.142:32972,
> node id 0
> n-1<18947> ssi:boot:base:server: got connection from 130.144.81.142
> n-1<18947> ssi:boot:base:server: this connection is expected (n0)
> n-1<18950> ssi:boot:send_lamd: sending dli_port 33092
> n-1<18947> ssi:boot:base:server: remote lamd is at 130.144.81.142:33092
> n-1<18947> ssi:boot:base:linear: booting n1
> (nlcftcs12.ehv.cft.philips.com)
> n-1<18947> ssi:boot:rsh: starting lamd on
> (nlcftcs12.ehv.cft.philips.com)
> n-1<18947> ssi:boot:rsh: starting on n1
> (nlcftcs12.ehv.cft.philips.com): hboot -t -c lam-conf.lamd -d -s -I
> "-H 130.144.81.142 -P 32972 -n 1 -o 0"
> n-1<18947> ssi:boot:rsh: launching remotely
> n-1<18947> ssi:boot:rsh: attempting to execute: ssh -Y
> nlcftcs12.ehv.cft.philips.com -n 'echo $SHELL'
> n-1<18947> ssi:boot:rsh: remote shell /bin/ksh
> n-1<18947> ssi:boot:rsh: attempting to execute: ssh -Y
> nlcftcs12.ehv.cft.philips.com -n '( ! [ -e ./.profile] || .
> ./.profile;' hboot -t -c lam-conf.lamd -d -s -I '"-H 130.144.81.142 -P
> 32972 -n 1 -o 0"' )
> ERROR: LAM/MPI unexpectedly received the following on stderr:
> ksh: [: missing ]
> ksh: hboot: not found
> -----------------------------------------------------------------------
> ------
> LAM failed to execute a LAM binary on the remote node
> "nlcftcs12.ehv.cft.philips.com".
> Since LAM was already able to determine your remote shell as "hboot",
> it is probable that this is not an authentication problem.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> LAM tried to use the remote agent command "ssh"
> to invoke the following command:
>
> ssh -Y nlcftcs12.ehv.cft.philips.com -n '( ! [ -e ./.profile]
> || . ./.profile;' hboot -t -c lam-conf.lamd -d -s -I '"-H
> 130.144.81.142 -P 32972 -n 1 -o 0"' )
>
> This can indicate several things. You should check the following:
>
> - The LAM binaries are in your $PATH
> - You can run the LAM binaries
> - The $PATH variable is set properly before your
> .cshrc/.profile exits
>
> Try to invoke the command listed above manually at a Unix prompt.
>
> You will need to configure your local setup such that you will *not*
> be prompted for a password to invoke this command on the remote node.
> No output should be printed from the remote node before the output of
> the command is displayed.
>
> When you can get this command to execute successfully by hand, LAM
> will probably be able to function properly.
> -----------------------------------------------------------------------
> ------
> n-1<18947> ssi:boot:base:linear: Failed to boot n1
> (nlcftcs12.ehv.cft.philips.com)
> n-1<18947> ssi:boot:base:server: closing server socket
> n-1<18947> ssi:boot:base:linear: aborted!
> n-1<18957> ssi:boot:open: opening
> n-1<18957> ssi:boot:open: opening boot module globus
> n-1<18957> ssi:boot:open: opened boot module globus
> n-1<18957> ssi:boot:open: opening boot module rsh
> n-1<18957> ssi:boot:open: opened boot module rsh
> n-1<18957> ssi:boot:open: opening boot module slurm
> n-1<18957> ssi:boot:open: opened boot module slurm
> n-1<18957> ssi:boot:select: initializing boot module slurm
> n-1<18957> ssi:boot:slurm: not running under SLURM
> n-1<18957> ssi:boot:select: boot module not available: slurm
> n-1<18957> ssi:boot:select: initializing boot module rsh
> n-1<18957> ssi:boot:rsh: module initializing
> n-1<18957> ssi:boot:rsh:agent: ssh -Y
> n-1<18957> ssi:boot:rsh:username: <same>
> n-1<18957> ssi:boot:rsh:verbose: 1000
> n-1<18957> ssi:boot:rsh:algorithm: linear
> n-1<18957> ssi:boot:rsh:no_n: 0
> n-1<18957> ssi:boot:rsh:no_profile: 0
> n-1<18957> ssi:boot:rsh:fast: 0
> n-1<18957> ssi:boot:rsh:ignore_stderr: 0
> n-1<18957> ssi:boot:rsh:priority: 10
> n-1<18957> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<18957> ssi:boot:select: initializing boot module globus
> n-1<18957> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<18957> ssi:boot:select: boot module not available: globus
> n-1<18957> ssi:boot:select: finalizing boot module slurm
> n-1<18957> ssi:boot:slurm: finalizing
> n-1<18957> ssi:boot:select: closing boot module slurm
> n-1<18957> ssi:boot:select: finalizing boot module globus
> n-1<18957> ssi:boot:globus: finalizing
> n-1<18957> ssi:boot:select: closing boot module globus
> n-1<18957> ssi:boot:select: selected boot module rsh
> n-1<18957> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<18957> ssi:boot:base: <current directory>
> n-1<18957> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<18957> ssi:boot:base: $LAMHOME/etc
> n-1<18957> ssi:boot:base: /cadappl/lam/7.1.1-32.sge/etc
> n-1<18957> ssi:boot:base: looking for boot schema file:
> n-1<18957> ssi:boot:base:
> /cadappl/lam/7.1.1-32.sge/etc/lam-hostmap.txt
> n-1<18957> ssi:boot:base: found boot schema:
> /cadappl/lam/7.1.1-32.sge/etc/lam-hostmap.txt
> n-1<18957> ssi:boot:rsh: found the following hosts:
> n-1<18957> ssi:boot:rsh: n0 nlcftcs11.ehv.cft.philips.com (cpu=1)
> n-1<18957> ssi:boot:rsh: n1 nlcftcs12.ehv.cft.philips.com (cpu=1)
> n-1<18957> ssi:boot:rsh: n2 nlcftcs13.ehv.cft.philips.com (cpu=1)
> n-1<18957> ssi:boot:rsh: n3 nlcftcs14.ehv.cft.philips.com (cpu=1)
> n-1<18957> ssi:boot:rsh: resolved hosts:
> n-1<18957> ssi:boot:rsh: n0 nlcftcs11.ehv.cft.philips.com -->
> 130.144.81.142 (origin)
> n-1<18957> ssi:boot:rsh: n1 nlcftcs12.ehv.cft.philips.com -->
> 130.144.81.150
> n-1<18957> ssi:boot:rsh: n2 nlcftcs13.ehv.cft.philips.com -->
> 130.144.81.156
> n-1<18957> ssi:boot:rsh: n3 nlcftcs14.ehv.cft.philips.com -->
> 130.144.81.162
> n-1<18957> ssi:boot:rsh: starting RTE procs
> n-1<18957> ssi:boot:base:linear: starting
> n-1<18957> ssi:boot:base:linear: booting n0
> (nlcftcs11.ehv.cft.philips.com)
> n-1<18957> ssi:boot:rsh: starting wipe on
> (nlcftcs11.ehv.cft.philips.com)
> n-1<18957> ssi:boot:rsh: starting on n0
> (nlcftcs11.ehv.cft.philips.com): tkill -d
> n-1<18957> ssi:boot:rsh: launching locally
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-nly00281_at_nlcftcs11/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-nly00281_at_nlcftcs11/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-nly00281_at_nlcftcs11/lam-io-socket
> tkill: f_kill = "/tmp/lam-nly00281_at_nlcftcs11/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 18950 ...
> tkill: killed
> tkill: all finished
> n-1<18957> ssi:boot:rsh: successfully launched on n0
> (nlcftcs11.ehv.cft.philips.com)
> n-1<18957> ssi:boot:base:linear: booting n1
> (nlcftcs12.ehv.cft.philips.com)
> n-1<18957> ssi:boot:rsh: starting wipe on
> (nlcftcs12.ehv.cft.philips.com)
> n-1<18957> ssi:boot:rsh: starting on n1
> (nlcftcs12.ehv.cft.philips.com): tkill -d
> n-1<18957> ssi:boot:rsh: launching remotely
> n-1<18957> ssi:boot:rsh: attempting to execute: ssh -Y
> nlcftcs12.ehv.cft.philips.com -n 'echo $SHELL'
> n-1<18957> ssi:boot:rsh: remote shell /bin/ksh
> n-1<18957> ssi:boot:rsh: attempting to execute: ssh -Y
> nlcftcs12.ehv.cft.philips.com -n '( ! [ -e ./.profile] || .
> ./.profile;' tkill -d )
> ERROR: LAM/MPI unexpectedly received the following on stderr:
> ksh: [: missing ]
> ksh: tkill: not found
> -----------------------------------------------------------------------
> ------
> LAM failed to execute a LAM binary on the remote node
> "nlcftcs12.ehv.cft.philips.com".
> Since LAM was already able to determine your remote shell as "tkill",
> it is probable that this is not an authentication problem.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> LAM tried to use the remote agent command "ssh"
> to invoke the following command:
>
> ssh -Y nlcftcs12.ehv.cft.philips.com -n '( ! [ -e ./.profile]
> || . ./.profile;' tkill -d )
>
> This can indicate several things. You should check the following:
>
> - The LAM binaries are in your $PATH
> - You can run the LAM binaries
> - The $PATH variable is set properly before your
> .cshrc/.profile exits
>
> Try to invoke the command listed above manually at a Unix prompt.
>
> You will need to configure your local setup such that you will *not*
> be prompted for a password to invoke this command on the remote node.
> No output should be printed from the remote node before the output of
> the command is displayed.
>
> When you can get this command to execute successfully by hand, LAM
> will probably be able to function properly.
> -----------------------------------------------------------------------
> ------
> n-1<18957> ssi:boot:base:linear: Failed to boot n1
> (nlcftcs12.ehv.cft.philips.com)
> n-1<18957> ssi:boot:base:linear: aborted!
> lamboot did NOT complete successfully
> nly00281_at_nlcftcs11>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/