LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Lars Grabow (grabow_at_[hidden])
Date: 2005-05-31 17:21:17


Hello,

 

I've been using LAM 7.0.3 and now 7.1.1 quite successfully for the past
year. However, there is one random problem that frequently occurs and I
have no clue what I can do about it. We are running jobs on a 48 node Linux
RedHat 9.0 cluster. There is no firewall between the nodes, passwordless
logins work fine, etc. We use PBS Pro/Maui as the scheduling system. For
certain kinds of calculations, we repeatedly launch the same parallel
executable with different input files within one job. That usually works
fine, but in some cases lamboot fails. The sequence of LAM command between
the individual runs is:

 

export MPIHOME=/usr/local/lam-7.1.1

$MPIHOME/bin/recon $MACHINEFILE

$MPIHOME/bin/lamboot -b -d -s $MACHINEFILE

$MPIHOME/bin/mpirun -O -ssi rpi lamd -np ${NPROCS} executable <arguments>

$MPIHOME/bin/lamclean

$MPIHOME/bin/lamhalt

 

As an example I have the full debug output of one job below. It started the
process one time and finished fine. Then, when it starts with the second
loop, lamboot fails. If it fails, it always fails to connect to the master
node (origin = star25 in the example below). There is no option to change
our script to make independent jobs since the input files must be created on
the fly. Does anyone have an idea what's going wrong here? I appreciate any
ideas!

 

Thank you very much,

 

Lars

----------------------------------------------------------------------------
-

Woo hoo!

 

recon has completed successfully. This means that you will most likely

be able to boot LAM successfully with the "lamboot" command (but this

is not a guarantee). See the lamboot(1) manual page for more

information on the lamboot command.

 

If you have problems booting LAM (with lamboot) even though recon

worked successfully, enable the "-d" option to lamboot to examine each

step of lamboot and see what fails. Most situations where recon

succeeds and lamboot fails have to do with the hboot(1) command (that

lamboot invokes on each host in the hostfile).

----------------------------------------------------------------------------
-

n-1<9814> ssi:boot:open: opening

n-1<9814> ssi:boot:open: opening boot module globus

n-1<9814> ssi:boot:open: opened boot module globus

n-1<9814> ssi:boot:open: opening boot module rsh

n-1<9814> ssi:boot:open: opened boot module rsh

n-1<9814> ssi:boot:open: opening boot module slurm

n-1<9814> ssi:boot:open: opened boot module slurm

n-1<9814> ssi:boot:select: initializing boot module globus

n-1<9814> ssi:boot:globus: globus-job-run not found, globus boot will not
run

n-1<9814> ssi:boot:select: boot module not available: globus

n-1<9814> ssi:boot:select: initializing boot module rsh

n-1<9814> ssi:boot:rsh: module initializing

n-1<9814> ssi:boot:rsh:agent: ssh -x

n-1<9814> ssi:boot:rsh:username: <same>

n-1<9814> ssi:boot:rsh:verbose: 1000

n-1<9814> ssi:boot:rsh:algorithm: linear

n-1<9814> ssi:boot:rsh:no_n: 0

n-1<9814> ssi:boot:rsh:no_profile: 0

n-1<9814> ssi:boot:rsh:fast: 0

n-1<9814> ssi:boot:rsh:ignore_stderr: 0

n-1<9814> ssi:boot:rsh:priority: 10

n-1<9814> ssi:boot:select: boot module available: rsh, priority: 10

n-1<9814> ssi:boot:select: initializing boot module slurm

n-1<9814> ssi:boot:slurm: not running under SLURM

n-1<9814> ssi:boot:select: boot module not available: slurm

n-1<9814> ssi:boot:select: finalizing boot module globus

n-1<9814> ssi:boot:globus: finalizing

n-1<9814> ssi:boot:select: closing boot module globus

n-1<9814> ssi:boot:select: finalizing boot module slurm

n-1<9814> ssi:boot:slurm: finalizing

n-1<9814> ssi:boot:select: closing boot module slurm

n-1<9814> ssi:boot:select: selected boot module rsh

n-1<9814> ssi:boot:base: looking for boot schema in following directories:

n-1<9814> ssi:boot:base: <current directory>

n-1<9814> ssi:boot:base: $TROLLIUSHOME/etc

n-1<9814> ssi:boot:base: $LAMHOME/etc

n-1<9814> ssi:boot:base: /usr/local/lam-7.1.1//etc

n-1<9814> ssi:boot:base: looking for boot schema file:

n-1<9814> ssi:boot:base: /var/spool/PBS/aux/9286.polaris.che.wisc.edu

n-1<9814> ssi:boot:base: found boot schema:
/var/spool/PBS/aux/9286.polaris.che.wisc.edu

n-1<9814> ssi:boot:rsh: found the following hosts:

n-1<9814> ssi:boot:rsh: n0 star25 (cpu=1)

n-1<9814> ssi:boot:rsh: n1 star24 (cpu=1)

n-1<9814> ssi:boot:rsh: n2 star06 (cpu=1)

n-1<9814> ssi:boot:rsh: resolved hosts:

n-1<9814> ssi:boot:rsh: n0 star25 --> 11.0.0.25 (origin)

n-1<9814> ssi:boot:rsh: n1 star24 --> 11.0.0.24

n-1<9814> ssi:boot:rsh: n2 star06 --> 11.0.0.6

n-1<9814> ssi:boot:rsh: starting RTE procs

n-1<9814> ssi:boot:base:linear: starting

n-1<9814> ssi:boot:base:server: opening server TCP socket

n-1<9814> ssi:boot:base:server: opened port 33789

n-1<9814> ssi:boot:base:linear: booting n0 (star25)

n-1<9814> ssi:boot:rsh: starting lamd on (star25)

n-1<9814> ssi:boot:rsh: starting on n0 (star25): hboot -t -c lam-conf.lamd
-d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I -H

11.0.0.25 -P 33789 -n 0 -o 0

n-1<9814> ssi:boot:rsh: launching locally

n-1<9814> ssi:boot:rsh: successfully launched on n0 (star25)

n-1<9814> ssi:boot:base:server: expecting connection from finite list

n-1<9814> ssi:boot:base:server: got connection from 11.0.0.25

n-1<9814> ssi:boot:base:server: this connection is expected (n0)

n-1<9814> ssi:boot:base:server: remote lamd is at 11.0.0.25:32831

n-1<9814> ssi:boot:base:linear: booting n1 (star24)

n-1<9814> ssi:boot:rsh: starting lamd on (star24)

n-1<9814> ssi:boot:rsh: starting on n1 (star24): hboot -t -c lam-conf.lamd
-d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I "-H

 11.0.0.25 -P 33789 -n 1 -o 0"

n-1<9814> ssi:boot:rsh: launching remotely

n-1<9814> ssi:boot:rsh: attempting to execute: ssh -x star24 -n 'echo
$SHELL'

n-1<9814> ssi:boot:rsh: remote shell /bin/tcsh

n-1<9814> ssi:boot:rsh: attempting to execute: ssh -x star24 -n hboot -t -c
lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wis

c.edu -s -I '"-H 11.0.0.25 -P 33789 -n 1 -o 0"'

n-1<9814> ssi:boot:rsh: successfully launched on n1 (star24)

n-1<9814> ssi:boot:base:server: expecting connection from finite list

n-1<9814> ssi:boot:base:server: got connection from 11.0.0.24

n-1<9814> ssi:boot:base:server: this connection is expected (n1)

n-1<9814> ssi:boot:base:server: remote lamd is at 11.0.0.24:32793

n-1<9814> ssi:boot:base:linear: booting n2 (star06)

n-1<9814> ssi:boot:rsh: starting lamd on (star06)

n-1<9814> ssi:boot:rsh: starting on n2 (star06): hboot -t -c lam-conf.lamd
-d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I "-H

 11.0.0.25 -P 33789 -n 2 -o 0"

n-1<9814> ssi:boot:rsh: launching remotely

n-1<9814> ssi:boot:rsh: attempting to execute: ssh -x star06 -n 'echo
$SHELL'

n-1<9814> ssi:boot:rsh: remote shell /bin/tcsh

n-1<9814> ssi:boot:rsh: attempting to execute: ssh -x star06 -n hboot -t -c
lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wis

c.edu -s -I '"-H 11.0.0.25 -P 33789 -n 2 -o 0"'

n-1<9814> ssi:boot:rsh: successfully launched on n2 (star06)

n-1<9814> ssi:boot:base:server: expecting connection from finite list

n-1<9814> ssi:boot:base:server: got connection from 11.0.0.6

n-1<9814> ssi:boot:base:server: this connection is expected (n2)

n-1<9814> ssi:boot:base:server: remote lamd is at 11.0.0.6:32828

n-1<9814> ssi:boot:base:server: closing server socket

n-1<9814> ssi:boot:base:server: connecting to lamd at 11.0.0.25:33790

n-1<9814> ssi:boot:base:server: connected

n-1<9814> ssi:boot:base:server: sending number of links (3)

n-1<9814> ssi:boot:base:server: sending info: n0 (star25)

n-1<9814> ssi:boot:base:server: sending info: n1 (star24)

n-1<9814> ssi:boot:base:server: sending info: n2 (star06)

n-1<9814> ssi:boot:base:server: finished sending

n-1<9814> ssi:boot:base:server: disconnected from 11.0.0.25:33790

n-1<9814> ssi:boot:base:server: connecting to lamd at 11.0.0.24:32807

n-1<9814> ssi:boot:base:server: connected

n-1<9814> ssi:boot:base:server: sending number of links (3)

n-1<9814> ssi:boot:base:server: sending info: n0 (star25)

n-1<9814> ssi:boot:base:server: sending info: n1 (star24)

n-1<9814> ssi:boot:base:server: sending info: n2 (star06)

n-1<9814> ssi:boot:base:server: finished sending

n-1<9814> ssi:boot:base:server: disconnected from 11.0.0.24:32807

n-1<9814> ssi:boot:base:server: connecting to lamd at 11.0.0.6:32870

n-1<9814> ssi:boot:base:server: connected

n-1<9814> ssi:boot:base:server: sending number of links (3)

n-1<9814> ssi:boot:base:server: sending info: n0 (star25)

n-1<9814> ssi:boot:base:server: sending info: n1 (star24)

n-1<9814> ssi:boot:base:server: sending info: n2 (star06)

n-1<9814> ssi:boot:base:server: finished sending

n-1<9814> ssi:boot:base:server: disconnected from 11.0.0.6:32870

n-1<9814> ssi:boot:base:linear: finished

n-1<9814> ssi:boot:rsh: all RTE procs started

n-1<9814> ssi:boot:rsh: finalizing

n-1<9814> ssi:boot: Closing

----------------------------------------------------------------------------
-

Woo hoo!

 

recon has completed successfully. This means that you will most likely

be able to boot LAM successfully with the "lamboot" command (but this

is not a guarantee). See the lamboot(1) manual page for more

information on the lamboot command.

 

If you have problems booting LAM (with lamboot) even though recon

worked successfully, enable the "-d" option to lamboot to examine each

step of lamboot and see what fails. Most situations where recon

succeeds and lamboot fails have to do with the hboot(1) command (that

lamboot invokes on each host in the hostfile).

----------------------------------------------------------------------------
-

n-1<9873> ssi:boot:open: opening

n-1<9873> ssi:boot:open: opening boot module globus

n-1<9873> ssi:boot:open: opened boot module globus

n-1<9873> ssi:boot:open: opening boot module rsh

n-1<9873> ssi:boot:open: opened boot module rsh

n-1<9873> ssi:boot:open: opening boot module slurm

n-1<9873> ssi:boot:open: opened boot module slurm

n-1<9873> ssi:boot:select: initializing boot module globus

n-1<9873> ssi:boot:globus: globus-job-run not found, globus boot will not
run

n-1<9873> ssi:boot:select: boot module not available: globus

n-1<9873> ssi:boot:select: initializing boot module rsh

n-1<9873> ssi:boot:rsh: module initializing

n-1<9873> ssi:boot:rsh:agent: ssh -x

n-1<9873> ssi:boot:rsh:username: <same>

n-1<9873> ssi:boot:rsh:verbose: 1000

n-1<9873> ssi:boot:rsh:algorithm: linear

n-1<9873> ssi:boot:rsh:no_n: 0

n-1<9873> ssi:boot:rsh:no_profile: 0

n-1<9873> ssi:boot:rsh:fast: 0

n-1<9873> ssi:boot:rsh:ignore_stderr: 0

n-1<9873> ssi:boot:rsh:priority: 10

n-1<9873> ssi:boot:select: boot module available: rsh, priority: 10

n-1<9873> ssi:boot:select: initializing boot module slurm

n-1<9873> ssi:boot:slurm: not running under SLURM

n-1<9873> ssi:boot:select: boot module not available: slurm

n-1<9873> ssi:boot:select: finalizing boot module globus

n-1<9873> ssi:boot:globus: finalizing

n-1<9873> ssi:boot:select: closing boot module globus

n-1<9873> ssi:boot:select: finalizing boot module slurm

n-1<9873> ssi:boot:slurm: finalizing

n-1<9873> ssi:boot:select: closing boot module slurm

n-1<9873> ssi:boot:select: selected boot module rsh

n-1<9873> ssi:boot:base: looking for boot schema in following directories:

n-1<9873> ssi:boot:base: <current directory>

n-1<9873> ssi:boot:base: $TROLLIUSHOME/etc

n-1<9873> ssi:boot:base: $LAMHOME/etc

n-1<9873> ssi:boot:base: /usr/local/lam-7.1.1//etc

n-1<9873> ssi:boot:base: looking for boot schema file:

n-1<9873> ssi:boot:base: /var/spool/PBS/aux/9286.polaris.che.wisc.edu

n-1<9873> ssi:boot:base: found boot schema:
/var/spool/PBS/aux/9286.polaris.che.wisc.edu

n-1<9873> ssi:boot:rsh: found the following hosts:

n-1<9873> ssi:boot:rsh: n0 star25 (cpu=1)

n-1<9873> ssi:boot:rsh: n1 star24 (cpu=1)

n-1<9873> ssi:boot:rsh: n2 star06 (cpu=1)

n-1<9873> ssi:boot:rsh: resolved hosts:

n-1<9873> ssi:boot:rsh: n0 star25 --> 11.0.0.25 (origin)

n-1<9873> ssi:boot:rsh: n1 star24 --> 11.0.0.24

n-1<9873> ssi:boot:rsh: n2 star06 --> 11.0.0.6

n-1<9873> ssi:boot:rsh: starting RTE procs

n-1<9873> ssi:boot:base:linear: starting

n-1<9873> ssi:boot:base:server: opening server TCP socket

n-1<9873> ssi:boot:base:server: opened port 33806

n-1<9873> ssi:boot:base:linear: booting n0 (star25)

n-1<9873> ssi:boot:rsh: starting lamd on (star25)

n-1<9873> ssi:boot:rsh: starting on n0 (star25): hboot -t -c lam-conf.lamd
-d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I -H

11.0.0.25 -P 33806 -n 0 -o 0

n-1<9873> ssi:boot:rsh: launching locally

n-1<9873> ssi:boot:rsh: successfully launched on n0 (star25)

n-1<9873> ssi:boot:base:server: expecting connection from finite list

n-1<9873> ssi:boot:base:server: got connection from 11.0.0.25

n-1<9873> ssi:boot:base:server: this connection is expected (n0)

n-1<9873> ssi:boot:base:server: remote lamd is at 11.0.0.25:32833

n-1<9873> ssi:boot:base:linear: booting n1 (star24)

n-1<9873> ssi:boot:rsh: starting lamd on (star24)

n-1<9873> ssi:boot:rsh: starting on n1 (star24): hboot -t -c lam-conf.lamd
-d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I "-H

 11.0.0.25 -P 33806 -n 1 -o 0"

n-1<9873> ssi:boot:rsh: launching remotely

n-1<9873> ssi:boot:rsh: attempting to execute: ssh -x star24 -n 'echo
$SHELL'

n-1<9873> ssi:boot:rsh: remote shell /bin/tcsh

n-1<9873> ssi:boot:rsh: attempting to execute: ssh -x star24 -n hboot -t -c
lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wis

c.edu -s -I '"-H 11.0.0.25 -P 33806 -n 1 -o 0"'

n-1<9873> ssi:boot:rsh: successfully launched on n1 (star24)

n-1<9873> ssi:boot:base:server: expecting connection from finite list

n-1<9873> ssi:boot:base:server: got connection from 11.0.0.24

n-1<9873> ssi:boot:base:server: this connection is expected (n1)

n-1<9873> ssi:boot:base:server: remote lamd is at 11.0.0.24:32794

n-1<9873> ssi:boot:base:linear: booting n2 (star06)

n-1<9873> ssi:boot:rsh: starting lamd on (star06)

n-1<9873> ssi:boot:rsh: starting on n2 (star06): hboot -t -c lam-conf.lamd
-d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I "-H

 11.0.0.25 -P 33806 -n 2 -o 0"

n-1<9873> ssi:boot:rsh: launching remotely

n-1<9873> ssi:boot:rsh: attempting to execute: ssh -x star06 -n 'echo
$SHELL'

n-1<9873> ssi:boot:rsh: remote shell /bin/tcsh

n-1<9873> ssi:boot:rsh: attempting to execute: ssh -x star06 -n hboot -t -c
lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wis

c.edu -s -I '"-H 11.0.0.25 -P 33806 -n 2 -o 0"'

n-1<9873> ssi:boot:rsh: successfully launched on n2 (star06)

n-1<9873> ssi:boot:base:server: expecting connection from finite list

n-1<9873> ssi:boot:base:server: got connection from 11.0.0.6

n-1<9873> ssi:boot:base:server: this connection is expected (n2)

n-1<9873> ssi:boot:base:server: remote lamd is at 11.0.0.6:32829

n-1<9873> ssi:boot:base:server: closing server socket

n-1<9873> ssi:boot:base:server: connecting to lamd at 11.0.0.25:33807

----------------------------------------------------------------------------
-

The lamboot agent failed to open a client socket to the newly-booted

process at IP address 11.0.0.25, port 33807.

 

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND

*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ

*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S

*** MAILING LIST.

 

Although the newly-booted process has already communicated

successfully with the lamboot agent over other TCP sockets, this is

the first time that the lamboot agent tried to initiate a connection

to the newly-booted process. As such, this may indicate:

 

        1. 11.0.0.25 is not the correct IP address for the machine where the

           newly-booted machine was launched

        2. There are network filters between the lamboot agent host and

           the remote host such that communication on random TCP ports

           is blocked

        3. Network routing from the the local host to the remote isn't

           properly configured (this is unlikely)

 

For number 1, check to ensure that 11.0.0.25 is the correct IP address for

that machine. If it is not, check the host mapping on that machine

(e.g., /etc/hosts) to ensure that 11.0.0.25 is both reachable and is the by

the host where the lamboot agent is running, and is the correct host.

 

For numbers 2 and 4, try to telnet to 11.0.0.25, port 33807. You should get
a

"connection refused" error, which will indicate that you successfully

connected to some machine at that IP address, and no process was

listening on that port. If you get any other kind of error, check

with your system/network administrator -- it may indicate network /

routing issues between the two hosts.

----------------------------------------------------------------------------
-

n-1<9873> ssi:boot:base:server: connecting to lamd at 11.0.0.24:32809

n-1<9873> ssi:boot:base:server: connected

n-1<9873> ssi:boot:base:server: sending number of links (3)

n-1<9873> ssi:boot:base:server: sending info: n0 (star25)

n-1<9873> ssi:boot:base:server: sending info: n1 (star24)

n-1<9873> ssi:boot:base:server: sending info: n2 (star06)

n-1<9873> ssi:boot:base:server: finished sending

n-1<9873> ssi:boot:base:server: disconnected from 11.0.0.24:32809

n-1<9873> ssi:boot:base:server: connecting to lamd at 11.0.0.6:32872

n-1<9873> ssi:boot:base:server: connected

n-1<9873> ssi:boot:base:server: sending number of links (3)

n-1<9873> ssi:boot:base:server: sending info: n0 (star25)

n-1<9873> ssi:boot:base:server: sending info: n1 (star24)

n-1<9873> ssi:boot:base:server: sending info: n2 (star06)

n-1<9873> ssi:boot:base:server: finished sending

n-1<9873> ssi:boot:base:server: disconnected from 11.0.0.6:32872

n-1<9873> ssi:boot:base:linear: aborted!

----------------------------------------------------------------------------
-

Synopsis: lamwipe [-d] [-h] [-H] [-v] [-V] [-nn] [-np]

                        [-prefix </lam/install/path/>] [-w <#>] [<bhost>]

 

Description: This command has been obsoleted by the "lamhalt" command.

                You should be using that instead. However, "lamwipe" can

                still be used to shut down a LAM universe.

 

Options:

        -b Use the faster lamwipe algorithm; will only work if shell

                on all remote nodes is same as shell on local node

        -d Print debugging message (implies -v)

        -h Print this message

        -H Don't print the header

        -nn Don't add "-n" to the remote agent command line

        -np Do not force the execution of $HOME/.profile on remote

                hosts

        -prefix Use the LAM installation in <lam/install/path/>

        -v Be verbose

        -V Print version and exit without shutting down LAM

        -w <#> Lamwipe the first <#> nodes

        <bhost> Use <bhost> as the boot schema

----------------------------------------------------------------------------
-

lamboot did NOT complete successfully

----------------------------------------------------------------------------
-

It seems that there is no lamd running on the host
star25.galaxy.che.wisc.edu.

 

This indicates that the LAM/MPI runtime environment is not operating.

The LAM/MPI runtime environment is necessary for the "mpirun" command.

 

Please run the "lamboot" command the start the LAM/MPI runtime

environment. See the LAM/MPI documentation for how to invoke

"lamboot" across multiple machines.

----------------------------------------------------------------------------
-

----------------------------------------------------------------------------
-

It seems that there is no lamd running on the host
star25.galaxy.che.wisc.edu.

 

This indicates that the LAM/MPI runtime environment is not operating.

The LAM/MPI runtime environment is necessary for the "lamclean" command.

 

Please run the "lamboot" command the start the LAM/MPI runtime

environment. See the LAM/MPI documentation for how to invoke

"lamboot" across multiple machines.

----------------------------------------------------------------------------
-

----------------------------------------------------------------------------
-

It seems that there is no lamd running on the host
star25.galaxy.che.wisc.edu.

 

This indicates that the LAM/MPI runtime environment is not operating.

The LAM/MPI runtime environment is necessary for the "lamhalt" command.

 

Please run the "lamboot" command the start the LAM/MPI runtime

environment. See the LAM/MPI documentation for how to invoke

"lamboot" across multiple machines.

----------------------------------------------------------------------------
-