Hi all,
Perhaps you can help me out because I'm at a loss around here.
I'm compiling lam-mpi (7.1.1) from source with the PGI compilers and
configured it with the following options:
./configure --prefix=/cadappl/lam/7.1.1-32.sge --with-rsh="ssh -Y"
The build and installtion proces goes well but whenever I try a
'lamboot' it fails with the following messages:
LAM failed to execute a LAM binary on the remote node "nlcftcs12.ehv.cft.philips.com".
Since LAM was already able to determine your remote shell as "hboot",
it is probable that this is not an authentication problem.
---
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
LAM tried to use the remote agent command "ssh"
to invoke the following command:
ssh -Y nlcftcs12.ehv.cft.philips.com -n '( ! [ -e ./.profile] || . ./.profile;' tkill -d )
---
I know that the ssh command should read as follows:
ssh -Y nlcftcs12.ehv.cft.philips.com -n \
'( ! [ -e ./.profile ]; . ./.profile ; /cadappl/lam/7.1.1-32.sge/bin/tkill )'
The placement of the "'" after the sourcing of my .profile is wrong and
the ']' should be ended with a ';'.
However, I'm not able to find any location as to where this is
configured and it seems to be hardcoded.
It also does _not_ appear in a beta version 7.1.2 and lamboot seems to
work perfectly with this version. (no differences in configure options
except the prefix)
I've attached the output of lamboot -d lam-hostfile.txt.
Does anyone know what the difference might be between 7.1.1 and 7.1.2
that explains this behaviour?
Thanx in advance,
Jeroen Kleijer
-----------------------------
nly00281_at_nlcftcs11> lamboot -d /cadappl/lam/7.1.1-32.sge/etc/lam-hostmap.txt \
-prefix /cadappl/lam/7.1.1-32.sge
n-1<18947> ssi:boot:open: opening
n-1<18947> ssi:boot:open: opening boot module globus
n-1<18947> ssi:boot:open: opened boot module globus
n-1<18947> ssi:boot:open: opening boot module rsh
n-1<18947> ssi:boot:open: opened boot module rsh
n-1<18947> ssi:boot:open: opening boot module slurm
n-1<18947> ssi:boot:open: opened boot module slurm
n-1<18947> ssi:boot:select: initializing boot module slurm
n-1<18947> ssi:boot:slurm: not running under SLURM
n-1<18947> ssi:boot:select: boot module not available: slurm
n-1<18947> ssi:boot:select: initializing boot module rsh
n-1<18947> ssi:boot:rsh: module initializing
n-1<18947> ssi:boot:rsh:agent: ssh -Y
n-1<18947> ssi:boot:rsh:username: <same>
n-1<18947> ssi:boot:rsh:verbose: 1000
n-1<18947> ssi:boot:rsh:algorithm: linear
n-1<18947> ssi:boot:rsh:no_n: 0
n-1<18947> ssi:boot:rsh:no_profile: 0
n-1<18947> ssi:boot:rsh:fast: 0
n-1<18947> ssi:boot:rsh:ignore_stderr: 0
n-1<18947> ssi:boot:rsh:priority: 10
n-1<18947> ssi:boot:select: boot module available: rsh, priority: 10
n-1<18947> ssi:boot:select: initializing boot module globus
n-1<18947> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<18947> ssi:boot:select: boot module not available: globus
n-1<18947> ssi:boot:select: finalizing boot module slurm
n-1<18947> ssi:boot:slurm: finalizing
n-1<18947> ssi:boot:select: closing boot module slurm
n-1<18947> ssi:boot:select: finalizing boot module globus
n-1<18947> ssi:boot:globus: finalizing
n-1<18947> ssi:boot:select: closing boot module globus
n-1<18947> ssi:boot:select: selected boot module rsh
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
n-1<18947> ssi:boot:base: looking for boot schema in following directories:
n-1<18947> ssi:boot:base: <current directory>
n-1<18947> ssi:boot:base: $TROLLIUSHOME/etc
n-1<18947> ssi:boot:base: $LAMHOME/etc
n-1<18947> ssi:boot:base: /cadappl/lam/7.1.1-32.sge/etc
n-1<18947> ssi:boot:base: looking for boot schema file:
n-1<18947> ssi:boot:base: /cadappl/lam/7.1.1-32.sge/etc/lam-hostmap.txt
n-1<18947> ssi:boot:base: found boot schema: /cadappl/lam/7.1.1-32.sge/etc/lam-hostmap.txt
n-1<18947> ssi:boot:rsh: found the following hosts:
n-1<18947> ssi:boot:rsh: n0 nlcftcs11.ehv.cft.philips.com (cpu=1)
n-1<18947> ssi:boot:rsh: n1 nlcftcs12.ehv.cft.philips.com (cpu=1)
n-1<18947> ssi:boot:rsh: n2 nlcftcs13.ehv.cft.philips.com (cpu=1)
n-1<18947> ssi:boot:rsh: n3 nlcftcs14.ehv.cft.philips.com (cpu=1)
n-1<18947> ssi:boot:rsh: resolved hosts:
n-1<18947> ssi:boot:rsh: n0 nlcftcs11.ehv.cft.philips.com --> 130.144.81.142 (origin)
n-1<18947> ssi:boot:rsh: n1 nlcftcs12.ehv.cft.philips.com --> 130.144.81.150
n-1<18947> ssi:boot:rsh: n2 nlcftcs13.ehv.cft.philips.com --> 130.144.81.156
n-1<18947> ssi:boot:rsh: n3 nlcftcs14.ehv.cft.philips.com --> 130.144.81.162
n-1<18947> ssi:boot:rsh: starting RTE procs
n-1<18947> ssi:boot:base:linear: starting
n-1<18947> ssi:boot:base:server: opening server TCP socket
n-1<18947> ssi:boot:base:server: opened port 32972
n-1<18947> ssi:boot:base:linear: booting n0 (nlcftcs11.ehv.cft.philips.com)
n-1<18947> ssi:boot:rsh: starting lamd on (nlcftcs11.ehv.cft.philips.com)
n-1<18947> ssi:boot:rsh: starting on n0 (nlcftcs11.ehv.cft.philips.com): hboot -t -c lam-conf.lamd -d -I -H 130.144.81.142 -P 32972 -n 0 -o 0
n-1<18947> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-nly00281_at_nlcftcs11/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-nly00281_at_nlcftcs11/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-nly00281_at_nlcftcs11/lam-io-socket
tkill: f_kill = "/tmp/lam-nly00281_at_nlcftcs11/lam-killfile"
tkill: nothing to kill: "/tmp/lam-nly00281_at_nlcftcs11/lam-killfile"
hboot: booting...
hboot: fork /cadappl/lam/7.1.1-32.sge/bin/lamd
hboot: attempting to execute
[1] 18950 lamd -H 130.144.81.142 -P 32972 -n 0 -o 0 -d
n-1<18947> ssi:boot:rsh: successfully launched on n0 (nlcftcs11.ehv.cft.philips.com)
n-1<18947> ssi:boot:base:server: expecting connection from finite list
n-1<18950> ssi:boot:open: opening
n-1<18950> ssi:boot:open: opening boot module globus
n-1<18950> ssi:boot:open: opened boot module globus
n-1<18950> ssi:boot:open: opening boot module rsh
n-1<18950> ssi:boot:open: opened boot module rsh
n-1<18950> ssi:boot:open: opening boot module slurm
n-1<18950> ssi:boot:open: opened boot module slurm
n-1<18950> ssi:boot:select: initializing boot module slurm
n-1<18950> ssi:boot:slurm: not running under SLURM
n-1<18950> ssi:boot:select: boot module not available: slurm
n-1<18950> ssi:boot:select: initializing boot module rsh
n-1<18950> ssi:boot:rsh: module initializing
n-1<18950> ssi:boot:rsh:agent: ssh -Y
n-1<18950> ssi:boot:rsh:username: <same>
n-1<18950> ssi:boot:rsh:verbose: 1000
n-1<18950> ssi:boot:rsh:algorithm: linear
n-1<18950> ssi:boot:rsh:no_n: 0
n-1<18950> ssi:boot:rsh:no_profile: 0
n-1<18950> ssi:boot:rsh:fast: 0
n-1<18950> ssi:boot:rsh:ignore_stderr: 0
n-1<18950> ssi:boot:rsh:priority: 10
n-1<18950> ssi:boot:select: boot module available: rsh, priority: 10
n-1<18950> ssi:boot:select: initializing boot module globus
n-1<18950> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<18950> ssi:boot:select: boot module not available: globus
n-1<18950> ssi:boot:select: finalizing boot module slurm
n-1<18950> ssi:boot:slurm: finalizing
n-1<18950> ssi:boot:select: closing boot module slurm
n-1<18950> ssi:boot:select: finalizing boot module globus
n-1<18950> ssi:boot:globus: finalizing
n-1<18950> ssi:boot:select: closing boot module globus
n-1<18950> ssi:boot:select: selected boot module rsh
n-1<18950> ssi:boot:send_lamd: getting node ID from command line
n-1<18950> ssi:boot:send_lamd: getting agent haddr from command line
n-1<18950> ssi:boot:send_lamd: getting agent port from command line
n-1<18950> ssi:boot:send_lamd: getting node ID from command line
n-1<18950> ssi:boot:send_lamd: connecting to 130.144.81.142:32972, node id 0
n-1<18947> ssi:boot:base:server: got connection from 130.144.81.142
n-1<18947> ssi:boot:base:server: this connection is expected (n0)
n-1<18950> ssi:boot:send_lamd: sending dli_port 33092
n-1<18947> ssi:boot:base:server: remote lamd is at 130.144.81.142:33092
n-1<18947> ssi:boot:base:linear: booting n1 (nlcftcs12.ehv.cft.philips.com)
n-1<18947> ssi:boot:rsh: starting lamd on (nlcftcs12.ehv.cft.philips.com)
n-1<18947> ssi:boot:rsh: starting on n1 (nlcftcs12.ehv.cft.philips.com): hboot -t -c lam-conf.lamd -d -s -I "-H 130.144.81.142 -P 32972 -n 1 -o 0"
n-1<18947> ssi:boot:rsh: launching remotely
n-1<18947> ssi:boot:rsh: attempting to execute: ssh -Y nlcftcs12.ehv.cft.philips.com -n 'echo $SHELL'
n-1<18947> ssi:boot:rsh: remote shell /bin/ksh
n-1<18947> ssi:boot:rsh: attempting to execute: ssh -Y nlcftcs12.ehv.cft.philips.com -n '( ! [ -e ./.profile] || . ./.profile;' hboot -t -c lam-conf.lamd -d -s -I '"-H 130.144.81.142 -P 32972 -n 1 -o 0"' )
ERROR: LAM/MPI unexpectedly received the following on stderr:
ksh: [: missing ]
ksh: hboot: not found
-----------------------------------------------------------------------------
LAM failed to execute a LAM binary on the remote node "nlcftcs12.ehv.cft.philips.com".
Since LAM was already able to determine your remote shell as "hboot",
it is probable that this is not an authentication problem.
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
LAM tried to use the remote agent command "ssh"
to invoke the following command:
ssh -Y nlcftcs12.ehv.cft.philips.com -n '( ! [ -e ./.profile] || . ./.profile;' hboot -t -c lam-conf.lamd -d -s -I '"-H 130.144.81.142 -P 32972 -n 1 -o 0"' )
This can indicate several things. You should check the following:
- The LAM binaries are in your $PATH
- You can run the LAM binaries
- The $PATH variable is set properly before your
.cshrc/.profile exits
Try to invoke the command listed above manually at a Unix prompt.
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<18947> ssi:boot:base:linear: Failed to boot n1 (nlcftcs12.ehv.cft.philips.com)
n-1<18947> ssi:boot:base:server: closing server socket
n-1<18947> ssi:boot:base:linear: aborted!
n-1<18957> ssi:boot:open: opening
n-1<18957> ssi:boot:open: opening boot module globus
n-1<18957> ssi:boot:open: opened boot module globus
n-1<18957> ssi:boot:open: opening boot module rsh
n-1<18957> ssi:boot:open: opened boot module rsh
n-1<18957> ssi:boot:open: opening boot module slurm
n-1<18957> ssi:boot:open: opened boot module slurm
n-1<18957> ssi:boot:select: initializing boot module slurm
n-1<18957> ssi:boot:slurm: not running under SLURM
n-1<18957> ssi:boot:select: boot module not available: slurm
n-1<18957> ssi:boot:select: initializing boot module rsh
n-1<18957> ssi:boot:rsh: module initializing
n-1<18957> ssi:boot:rsh:agent: ssh -Y
n-1<18957> ssi:boot:rsh:username: <same>
n-1<18957> ssi:boot:rsh:verbose: 1000
n-1<18957> ssi:boot:rsh:algorithm: linear
n-1<18957> ssi:boot:rsh:no_n: 0
n-1<18957> ssi:boot:rsh:no_profile: 0
n-1<18957> ssi:boot:rsh:fast: 0
n-1<18957> ssi:boot:rsh:ignore_stderr: 0
n-1<18957> ssi:boot:rsh:priority: 10
n-1<18957> ssi:boot:select: boot module available: rsh, priority: 10
n-1<18957> ssi:boot:select: initializing boot module globus
n-1<18957> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<18957> ssi:boot:select: boot module not available: globus
n-1<18957> ssi:boot:select: finalizing boot module slurm
n-1<18957> ssi:boot:slurm: finalizing
n-1<18957> ssi:boot:select: closing boot module slurm
n-1<18957> ssi:boot:select: finalizing boot module globus
n-1<18957> ssi:boot:globus: finalizing
n-1<18957> ssi:boot:select: closing boot module globus
n-1<18957> ssi:boot:select: selected boot module rsh
n-1<18957> ssi:boot:base: looking for boot schema in following directories:
n-1<18957> ssi:boot:base: <current directory>
n-1<18957> ssi:boot:base: $TROLLIUSHOME/etc
n-1<18957> ssi:boot:base: $LAMHOME/etc
n-1<18957> ssi:boot:base: /cadappl/lam/7.1.1-32.sge/etc
n-1<18957> ssi:boot:base: looking for boot schema file:
n-1<18957> ssi:boot:base: /cadappl/lam/7.1.1-32.sge/etc/lam-hostmap.txt
n-1<18957> ssi:boot:base: found boot schema: /cadappl/lam/7.1.1-32.sge/etc/lam-hostmap.txt
n-1<18957> ssi:boot:rsh: found the following hosts:
n-1<18957> ssi:boot:rsh: n0 nlcftcs11.ehv.cft.philips.com (cpu=1)
n-1<18957> ssi:boot:rsh: n1 nlcftcs12.ehv.cft.philips.com (cpu=1)
n-1<18957> ssi:boot:rsh: n2 nlcftcs13.ehv.cft.philips.com (cpu=1)
n-1<18957> ssi:boot:rsh: n3 nlcftcs14.ehv.cft.philips.com (cpu=1)
n-1<18957> ssi:boot:rsh: resolved hosts:
n-1<18957> ssi:boot:rsh: n0 nlcftcs11.ehv.cft.philips.com --> 130.144.81.142 (origin)
n-1<18957> ssi:boot:rsh: n1 nlcftcs12.ehv.cft.philips.com --> 130.144.81.150
n-1<18957> ssi:boot:rsh: n2 nlcftcs13.ehv.cft.philips.com --> 130.144.81.156
n-1<18957> ssi:boot:rsh: n3 nlcftcs14.ehv.cft.philips.com --> 130.144.81.162
n-1<18957> ssi:boot:rsh: starting RTE procs
n-1<18957> ssi:boot:base:linear: starting
n-1<18957> ssi:boot:base:linear: booting n0 (nlcftcs11.ehv.cft.philips.com)
n-1<18957> ssi:boot:rsh: starting wipe on (nlcftcs11.ehv.cft.philips.com)
n-1<18957> ssi:boot:rsh: starting on n0 (nlcftcs11.ehv.cft.philips.com): tkill -d
n-1<18957> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-nly00281_at_nlcftcs11/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-nly00281_at_nlcftcs11/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-nly00281_at_nlcftcs11/lam-io-socket
tkill: f_kill = "/tmp/lam-nly00281_at_nlcftcs11/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 18950 ...
tkill: killed
tkill: all finished
n-1<18957> ssi:boot:rsh: successfully launched on n0 (nlcftcs11.ehv.cft.philips.com)
n-1<18957> ssi:boot:base:linear: booting n1 (nlcftcs12.ehv.cft.philips.com)
n-1<18957> ssi:boot:rsh: starting wipe on (nlcftcs12.ehv.cft.philips.com)
n-1<18957> ssi:boot:rsh: starting on n1 (nlcftcs12.ehv.cft.philips.com): tkill -d
n-1<18957> ssi:boot:rsh: launching remotely
n-1<18957> ssi:boot:rsh: attempting to execute: ssh -Y nlcftcs12.ehv.cft.philips.com -n 'echo $SHELL'
n-1<18957> ssi:boot:rsh: remote shell /bin/ksh
n-1<18957> ssi:boot:rsh: attempting to execute: ssh -Y nlcftcs12.ehv.cft.philips.com -n '( ! [ -e ./.profile] || . ./.profile;' tkill -d )
ERROR: LAM/MPI unexpectedly received the following on stderr:
ksh: [: missing ]
ksh: tkill: not found
-----------------------------------------------------------------------------
LAM failed to execute a LAM binary on the remote node "nlcftcs12.ehv.cft.philips.com".
Since LAM was already able to determine your remote shell as "tkill",
it is probable that this is not an authentication problem.
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
LAM tried to use the remote agent command "ssh"
to invoke the following command:
ssh -Y nlcftcs12.ehv.cft.philips.com -n '( ! [ -e ./.profile] || . ./.profile;' tkill -d )
This can indicate several things. You should check the following:
- The LAM binaries are in your $PATH
- You can run the LAM binaries
- The $PATH variable is set properly before your
.cshrc/.profile exits
Try to invoke the command listed above manually at a Unix prompt.
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<18957> ssi:boot:base:linear: Failed to boot n1 (nlcftcs12.ehv.cft.philips.com)
n-1<18957> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully
nly00281_at_nlcftcs11>
|