Hello,
I've been using LAM 7.0.3 and now 7.1.1 quite successfully for the past
year. However, there is one random problem that frequently occurs and I
have no clue what I can do about it. We are running jobs on a 48 node Linux
RedHat 9.0 cluster. There is no firewall between the nodes, passwordless
logins work fine, etc. We use PBS Pro/Maui as the scheduling system. For
certain kinds of calculations, we repeatedly launch the same parallel
executable with different input files within one job. That usually works
fine, but in some cases lamboot fails. The sequence of LAM command between
the individual runs is:
export MPIHOME=/usr/local/lam-7.1.1
$MPIHOME/bin/recon $MACHINEFILE
$MPIHOME/bin/lamboot -b -d -s $MACHINEFILE
$MPIHOME/bin/mpirun -O -ssi rpi lamd -np ${NPROCS} executable <arguments>
$MPIHOME/bin/lamclean
$MPIHOME/bin/lamhalt
As an example I have the full debug output of one job below. It started the
process one time and finished fine. Then, when it starts with the second
loop, lamboot fails. If it fails, it always fails to connect to the master
node (origin = star25 in the example below). There is no option to change
our script to make independent jobs since the input files must be created on
the fly. Does anyone have an idea what's going wrong here? I appreciate any
ideas!
Thank you very much,
Lars
----------------------------------------------------------------------------
-
Woo hoo!
recon has completed successfully. This means that you will most likely
be able to boot LAM successfully with the "lamboot" command (but this
is not a guarantee). See the lamboot(1) manual page for more
information on the lamboot command.
If you have problems booting LAM (with lamboot) even though recon
worked successfully, enable the "-d" option to lamboot to examine each
step of lamboot and see what fails. Most situations where recon
succeeds and lamboot fails have to do with the hboot(1) command (that
lamboot invokes on each host in the hostfile).
----------------------------------------------------------------------------
-
n-1<9814> ssi:boot:open: opening
n-1<9814> ssi:boot:open: opening boot module globus
n-1<9814> ssi:boot:open: opened boot module globus
n-1<9814> ssi:boot:open: opening boot module rsh
n-1<9814> ssi:boot:open: opened boot module rsh
n-1<9814> ssi:boot:open: opening boot module slurm
n-1<9814> ssi:boot:open: opened boot module slurm
n-1<9814> ssi:boot:select: initializing boot module globus
n-1<9814> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<9814> ssi:boot:select: boot module not available: globus
n-1<9814> ssi:boot:select: initializing boot module rsh
n-1<9814> ssi:boot:rsh: module initializing
n-1<9814> ssi:boot:rsh:agent: ssh -x
n-1<9814> ssi:boot:rsh:username: <same>
n-1<9814> ssi:boot:rsh:verbose: 1000
n-1<9814> ssi:boot:rsh:algorithm: linear
n-1<9814> ssi:boot:rsh:no_n: 0
n-1<9814> ssi:boot:rsh:no_profile: 0
n-1<9814> ssi:boot:rsh:fast: 0
n-1<9814> ssi:boot:rsh:ignore_stderr: 0
n-1<9814> ssi:boot:rsh:priority: 10
n-1<9814> ssi:boot:select: boot module available: rsh, priority: 10
n-1<9814> ssi:boot:select: initializing boot module slurm
n-1<9814> ssi:boot:slurm: not running under SLURM
n-1<9814> ssi:boot:select: boot module not available: slurm
n-1<9814> ssi:boot:select: finalizing boot module globus
n-1<9814> ssi:boot:globus: finalizing
n-1<9814> ssi:boot:select: closing boot module globus
n-1<9814> ssi:boot:select: finalizing boot module slurm
n-1<9814> ssi:boot:slurm: finalizing
n-1<9814> ssi:boot:select: closing boot module slurm
n-1<9814> ssi:boot:select: selected boot module rsh
n-1<9814> ssi:boot:base: looking for boot schema in following directories:
n-1<9814> ssi:boot:base: <current directory>
n-1<9814> ssi:boot:base: $TROLLIUSHOME/etc
n-1<9814> ssi:boot:base: $LAMHOME/etc
n-1<9814> ssi:boot:base: /usr/local/lam-7.1.1//etc
n-1<9814> ssi:boot:base: looking for boot schema file:
n-1<9814> ssi:boot:base: /var/spool/PBS/aux/9286.polaris.che.wisc.edu
n-1<9814> ssi:boot:base: found boot schema:
/var/spool/PBS/aux/9286.polaris.che.wisc.edu
n-1<9814> ssi:boot:rsh: found the following hosts:
n-1<9814> ssi:boot:rsh: n0 star25 (cpu=1)
n-1<9814> ssi:boot:rsh: n1 star24 (cpu=1)
n-1<9814> ssi:boot:rsh: n2 star06 (cpu=1)
n-1<9814> ssi:boot:rsh: resolved hosts:
n-1<9814> ssi:boot:rsh: n0 star25 --> 11.0.0.25 (origin)
n-1<9814> ssi:boot:rsh: n1 star24 --> 11.0.0.24
n-1<9814> ssi:boot:rsh: n2 star06 --> 11.0.0.6
n-1<9814> ssi:boot:rsh: starting RTE procs
n-1<9814> ssi:boot:base:linear: starting
n-1<9814> ssi:boot:base:server: opening server TCP socket
n-1<9814> ssi:boot:base:server: opened port 33789
n-1<9814> ssi:boot:base:linear: booting n0 (star25)
n-1<9814> ssi:boot:rsh: starting lamd on (star25)
n-1<9814> ssi:boot:rsh: starting on n0 (star25): hboot -t -c lam-conf.lamd
-d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I -H
11.0.0.25 -P 33789 -n 0 -o 0
n-1<9814> ssi:boot:rsh: launching locally
n-1<9814> ssi:boot:rsh: successfully launched on n0 (star25)
n-1<9814> ssi:boot:base:server: expecting connection from finite list
n-1<9814> ssi:boot:base:server: got connection from 11.0.0.25
n-1<9814> ssi:boot:base:server: this connection is expected (n0)
n-1<9814> ssi:boot:base:server: remote lamd is at 11.0.0.25:32831
n-1<9814> ssi:boot:base:linear: booting n1 (star24)
n-1<9814> ssi:boot:rsh: starting lamd on (star24)
n-1<9814> ssi:boot:rsh: starting on n1 (star24): hboot -t -c lam-conf.lamd
-d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I "-H
11.0.0.25 -P 33789 -n 1 -o 0"
n-1<9814> ssi:boot:rsh: launching remotely
n-1<9814> ssi:boot:rsh: attempting to execute: ssh -x star24 -n 'echo
$SHELL'
n-1<9814> ssi:boot:rsh: remote shell /bin/tcsh
n-1<9814> ssi:boot:rsh: attempting to execute: ssh -x star24 -n hboot -t -c
lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wis
c.edu -s -I '"-H 11.0.0.25 -P 33789 -n 1 -o 0"'
n-1<9814> ssi:boot:rsh: successfully launched on n1 (star24)
n-1<9814> ssi:boot:base:server: expecting connection from finite list
n-1<9814> ssi:boot:base:server: got connection from 11.0.0.24
n-1<9814> ssi:boot:base:server: this connection is expected (n1)
n-1<9814> ssi:boot:base:server: remote lamd is at 11.0.0.24:32793
n-1<9814> ssi:boot:base:linear: booting n2 (star06)
n-1<9814> ssi:boot:rsh: starting lamd on (star06)
n-1<9814> ssi:boot:rsh: starting on n2 (star06): hboot -t -c lam-conf.lamd
-d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I "-H
11.0.0.25 -P 33789 -n 2 -o 0"
n-1<9814> ssi:boot:rsh: launching remotely
n-1<9814> ssi:boot:rsh: attempting to execute: ssh -x star06 -n 'echo
$SHELL'
n-1<9814> ssi:boot:rsh: remote shell /bin/tcsh
n-1<9814> ssi:boot:rsh: attempting to execute: ssh -x star06 -n hboot -t -c
lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wis
c.edu -s -I '"-H 11.0.0.25 -P 33789 -n 2 -o 0"'
n-1<9814> ssi:boot:rsh: successfully launched on n2 (star06)
n-1<9814> ssi:boot:base:server: expecting connection from finite list
n-1<9814> ssi:boot:base:server: got connection from 11.0.0.6
n-1<9814> ssi:boot:base:server: this connection is expected (n2)
n-1<9814> ssi:boot:base:server: remote lamd is at 11.0.0.6:32828
n-1<9814> ssi:boot:base:server: closing server socket
n-1<9814> ssi:boot:base:server: connecting to lamd at 11.0.0.25:33790
n-1<9814> ssi:boot:base:server: connected
n-1<9814> ssi:boot:base:server: sending number of links (3)
n-1<9814> ssi:boot:base:server: sending info: n0 (star25)
n-1<9814> ssi:boot:base:server: sending info: n1 (star24)
n-1<9814> ssi:boot:base:server: sending info: n2 (star06)
n-1<9814> ssi:boot:base:server: finished sending
n-1<9814> ssi:boot:base:server: disconnected from 11.0.0.25:33790
n-1<9814> ssi:boot:base:server: connecting to lamd at 11.0.0.24:32807
n-1<9814> ssi:boot:base:server: connected
n-1<9814> ssi:boot:base:server: sending number of links (3)
n-1<9814> ssi:boot:base:server: sending info: n0 (star25)
n-1<9814> ssi:boot:base:server: sending info: n1 (star24)
n-1<9814> ssi:boot:base:server: sending info: n2 (star06)
n-1<9814> ssi:boot:base:server: finished sending
n-1<9814> ssi:boot:base:server: disconnected from 11.0.0.24:32807
n-1<9814> ssi:boot:base:server: connecting to lamd at 11.0.0.6:32870
n-1<9814> ssi:boot:base:server: connected
n-1<9814> ssi:boot:base:server: sending number of links (3)
n-1<9814> ssi:boot:base:server: sending info: n0 (star25)
n-1<9814> ssi:boot:base:server: sending info: n1 (star24)
n-1<9814> ssi:boot:base:server: sending info: n2 (star06)
n-1<9814> ssi:boot:base:server: finished sending
n-1<9814> ssi:boot:base:server: disconnected from 11.0.0.6:32870
n-1<9814> ssi:boot:base:linear: finished
n-1<9814> ssi:boot:rsh: all RTE procs started
n-1<9814> ssi:boot:rsh: finalizing
n-1<9814> ssi:boot: Closing
----------------------------------------------------------------------------
-
Woo hoo!
recon has completed successfully. This means that you will most likely
be able to boot LAM successfully with the "lamboot" command (but this
is not a guarantee). See the lamboot(1) manual page for more
information on the lamboot command.
If you have problems booting LAM (with lamboot) even though recon
worked successfully, enable the "-d" option to lamboot to examine each
step of lamboot and see what fails. Most situations where recon
succeeds and lamboot fails have to do with the hboot(1) command (that
lamboot invokes on each host in the hostfile).
----------------------------------------------------------------------------
-
n-1<9873> ssi:boot:open: opening
n-1<9873> ssi:boot:open: opening boot module globus
n-1<9873> ssi:boot:open: opened boot module globus
n-1<9873> ssi:boot:open: opening boot module rsh
n-1<9873> ssi:boot:open: opened boot module rsh
n-1<9873> ssi:boot:open: opening boot module slurm
n-1<9873> ssi:boot:open: opened boot module slurm
n-1<9873> ssi:boot:select: initializing boot module globus
n-1<9873> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<9873> ssi:boot:select: boot module not available: globus
n-1<9873> ssi:boot:select: initializing boot module rsh
n-1<9873> ssi:boot:rsh: module initializing
n-1<9873> ssi:boot:rsh:agent: ssh -x
n-1<9873> ssi:boot:rsh:username: <same>
n-1<9873> ssi:boot:rsh:verbose: 1000
n-1<9873> ssi:boot:rsh:algorithm: linear
n-1<9873> ssi:boot:rsh:no_n: 0
n-1<9873> ssi:boot:rsh:no_profile: 0
n-1<9873> ssi:boot:rsh:fast: 0
n-1<9873> ssi:boot:rsh:ignore_stderr: 0
n-1<9873> ssi:boot:rsh:priority: 10
n-1<9873> ssi:boot:select: boot module available: rsh, priority: 10
n-1<9873> ssi:boot:select: initializing boot module slurm
n-1<9873> ssi:boot:slurm: not running under SLURM
n-1<9873> ssi:boot:select: boot module not available: slurm
n-1<9873> ssi:boot:select: finalizing boot module globus
n-1<9873> ssi:boot:globus: finalizing
n-1<9873> ssi:boot:select: closing boot module globus
n-1<9873> ssi:boot:select: finalizing boot module slurm
n-1<9873> ssi:boot:slurm: finalizing
n-1<9873> ssi:boot:select: closing boot module slurm
n-1<9873> ssi:boot:select: selected boot module rsh
n-1<9873> ssi:boot:base: looking for boot schema in following directories:
n-1<9873> ssi:boot:base: <current directory>
n-1<9873> ssi:boot:base: $TROLLIUSHOME/etc
n-1<9873> ssi:boot:base: $LAMHOME/etc
n-1<9873> ssi:boot:base: /usr/local/lam-7.1.1//etc
n-1<9873> ssi:boot:base: looking for boot schema file:
n-1<9873> ssi:boot:base: /var/spool/PBS/aux/9286.polaris.che.wisc.edu
n-1<9873> ssi:boot:base: found boot schema:
/var/spool/PBS/aux/9286.polaris.che.wisc.edu
n-1<9873> ssi:boot:rsh: found the following hosts:
n-1<9873> ssi:boot:rsh: n0 star25 (cpu=1)
n-1<9873> ssi:boot:rsh: n1 star24 (cpu=1)
n-1<9873> ssi:boot:rsh: n2 star06 (cpu=1)
n-1<9873> ssi:boot:rsh: resolved hosts:
n-1<9873> ssi:boot:rsh: n0 star25 --> 11.0.0.25 (origin)
n-1<9873> ssi:boot:rsh: n1 star24 --> 11.0.0.24
n-1<9873> ssi:boot:rsh: n2 star06 --> 11.0.0.6
n-1<9873> ssi:boot:rsh: starting RTE procs
n-1<9873> ssi:boot:base:linear: starting
n-1<9873> ssi:boot:base:server: opening server TCP socket
n-1<9873> ssi:boot:base:server: opened port 33806
n-1<9873> ssi:boot:base:linear: booting n0 (star25)
n-1<9873> ssi:boot:rsh: starting lamd on (star25)
n-1<9873> ssi:boot:rsh: starting on n0 (star25): hboot -t -c lam-conf.lamd
-d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I -H
11.0.0.25 -P 33806 -n 0 -o 0
n-1<9873> ssi:boot:rsh: launching locally
n-1<9873> ssi:boot:rsh: successfully launched on n0 (star25)
n-1<9873> ssi:boot:base:server: expecting connection from finite list
n-1<9873> ssi:boot:base:server: got connection from 11.0.0.25
n-1<9873> ssi:boot:base:server: this connection is expected (n0)
n-1<9873> ssi:boot:base:server: remote lamd is at 11.0.0.25:32833
n-1<9873> ssi:boot:base:linear: booting n1 (star24)
n-1<9873> ssi:boot:rsh: starting lamd on (star24)
n-1<9873> ssi:boot:rsh: starting on n1 (star24): hboot -t -c lam-conf.lamd
-d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I "-H
11.0.0.25 -P 33806 -n 1 -o 0"
n-1<9873> ssi:boot:rsh: launching remotely
n-1<9873> ssi:boot:rsh: attempting to execute: ssh -x star24 -n 'echo
$SHELL'
n-1<9873> ssi:boot:rsh: remote shell /bin/tcsh
n-1<9873> ssi:boot:rsh: attempting to execute: ssh -x star24 -n hboot -t -c
lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wis
c.edu -s -I '"-H 11.0.0.25 -P 33806 -n 1 -o 0"'
n-1<9873> ssi:boot:rsh: successfully launched on n1 (star24)
n-1<9873> ssi:boot:base:server: expecting connection from finite list
n-1<9873> ssi:boot:base:server: got connection from 11.0.0.24
n-1<9873> ssi:boot:base:server: this connection is expected (n1)
n-1<9873> ssi:boot:base:server: remote lamd is at 11.0.0.24:32794
n-1<9873> ssi:boot:base:linear: booting n2 (star06)
n-1<9873> ssi:boot:rsh: starting lamd on (star06)
n-1<9873> ssi:boot:rsh: starting on n2 (star06): hboot -t -c lam-conf.lamd
-d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I "-H
11.0.0.25 -P 33806 -n 2 -o 0"
n-1<9873> ssi:boot:rsh: launching remotely
n-1<9873> ssi:boot:rsh: attempting to execute: ssh -x star06 -n 'echo
$SHELL'
n-1<9873> ssi:boot:rsh: remote shell /bin/tcsh
n-1<9873> ssi:boot:rsh: attempting to execute: ssh -x star06 -n hboot -t -c
lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wis
c.edu -s -I '"-H 11.0.0.25 -P 33806 -n 2 -o 0"'
n-1<9873> ssi:boot:rsh: successfully launched on n2 (star06)
n-1<9873> ssi:boot:base:server: expecting connection from finite list
n-1<9873> ssi:boot:base:server: got connection from 11.0.0.6
n-1<9873> ssi:boot:base:server: this connection is expected (n2)
n-1<9873> ssi:boot:base:server: remote lamd is at 11.0.0.6:32829
n-1<9873> ssi:boot:base:server: closing server socket
n-1<9873> ssi:boot:base:server: connecting to lamd at 11.0.0.25:33807
----------------------------------------------------------------------------
-
The lamboot agent failed to open a client socket to the newly-booted
process at IP address 11.0.0.25, port 33807.
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
Although the newly-booted process has already communicated
successfully with the lamboot agent over other TCP sockets, this is
the first time that the lamboot agent tried to initiate a connection
to the newly-booted process. As such, this may indicate:
1. 11.0.0.25 is not the correct IP address for the machine where the
newly-booted machine was launched
2. There are network filters between the lamboot agent host and
the remote host such that communication on random TCP ports
is blocked
3. Network routing from the the local host to the remote isn't
properly configured (this is unlikely)
For number 1, check to ensure that 11.0.0.25 is the correct IP address for
that machine. If it is not, check the host mapping on that machine
(e.g., /etc/hosts) to ensure that 11.0.0.25 is both reachable and is the by
the host where the lamboot agent is running, and is the correct host.
For numbers 2 and 4, try to telnet to 11.0.0.25, port 33807. You should get
a
"connection refused" error, which will indicate that you successfully
connected to some machine at that IP address, and no process was
listening on that port. If you get any other kind of error, check
with your system/network administrator -- it may indicate network /
routing issues between the two hosts.
----------------------------------------------------------------------------
-
n-1<9873> ssi:boot:base:server: connecting to lamd at 11.0.0.24:32809
n-1<9873> ssi:boot:base:server: connected
n-1<9873> ssi:boot:base:server: sending number of links (3)
n-1<9873> ssi:boot:base:server: sending info: n0 (star25)
n-1<9873> ssi:boot:base:server: sending info: n1 (star24)
n-1<9873> ssi:boot:base:server: sending info: n2 (star06)
n-1<9873> ssi:boot:base:server: finished sending
n-1<9873> ssi:boot:base:server: disconnected from 11.0.0.24:32809
n-1<9873> ssi:boot:base:server: connecting to lamd at 11.0.0.6:32872
n-1<9873> ssi:boot:base:server: connected
n-1<9873> ssi:boot:base:server: sending number of links (3)
n-1<9873> ssi:boot:base:server: sending info: n0 (star25)
n-1<9873> ssi:boot:base:server: sending info: n1 (star24)
n-1<9873> ssi:boot:base:server: sending info: n2 (star06)
n-1<9873> ssi:boot:base:server: finished sending
n-1<9873> ssi:boot:base:server: disconnected from 11.0.0.6:32872
n-1<9873> ssi:boot:base:linear: aborted!
----------------------------------------------------------------------------
-
Synopsis: lamwipe [-d] [-h] [-H] [-v] [-V] [-nn] [-np]
[-prefix </lam/install/path/>] [-w <#>] [<bhost>]
Description: This command has been obsoleted by the "lamhalt" command.
You should be using that instead. However, "lamwipe" can
still be used to shut down a LAM universe.
Options:
-b Use the faster lamwipe algorithm; will only work if shell
on all remote nodes is same as shell on local node
-d Print debugging message (implies -v)
-h Print this message
-H Don't print the header
-nn Don't add "-n" to the remote agent command line
-np Do not force the execution of $HOME/.profile on remote
hosts
-prefix Use the LAM installation in <lam/install/path/>
-v Be verbose
-V Print version and exit without shutting down LAM
-w <#> Lamwipe the first <#> nodes
<bhost> Use <bhost> as the boot schema
----------------------------------------------------------------------------
-
lamboot did NOT complete successfully
----------------------------------------------------------------------------
-
It seems that there is no lamd running on the host
star25.galaxy.che.wisc.edu.
This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for the "mpirun" command.
Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
----------------------------------------------------------------------------
-
----------------------------------------------------------------------------
-
It seems that there is no lamd running on the host
star25.galaxy.che.wisc.edu.
This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for the "lamclean" command.
Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
----------------------------------------------------------------------------
-
----------------------------------------------------------------------------
-
It seems that there is no lamd running on the host
star25.galaxy.che.wisc.edu.
This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for the "lamhalt" command.
Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
----------------------------------------------------------------------------
-
|