LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-05-31 22:26:38


Greetings!

Before answering your questions, I have a few:

- Is there a reason you're using the lamd rpi? It's rather slow
(compared to the tcp rpi, for example). Most people use the lamd rpi
only if their application can benefit from true asynchronous progress,
but are willing to pay the extra latency for it (see the LAM FAQ for
more details). This has nothing to do with your lamboot problem,
though.

- For your production code runs, if you are using the lamd rpi for
asynchronous progress, I would suggest not using "-d" as an option to
lamboot. It will significantly slow down the lamd (because it's
maintaining all kinds of debugging information), and therefore slow
down your MPI message passing.

- You can build LAM 7.1 with native PBS support such that it won't use
rsh or ssh to start processes -- it'll use PBS's native job launching
capabilities. See the LAM/MPI install guide for more details. This
really won't impact the correctness of your runs, but
lamboot/recon/etc. run slightly faster and give PBS better control over
parallel jobs (i.e., killing them if PBS decides that your entire job
needs to be killed). You also won't need to use $MACHINEFILE --
lamboot will automatically get the list of nodes to boot from PBS
itself on a per-job basis.

- I'm assuming that the shell code you have shown below is what is
running in the loop that you mentioned (I'm further assuming that the
loop is in a single PBS job). Is there a reason you're running lamboot
each time through the loop? You could just run lamboot once, and then
loop over calling mpirun as many times as you want, and then run
lamhalt (this is why LAM separates the "boot" phase from the "run"
phase -- so that you only have to pay the "boot" price once). If
you're not entirely sure that your applications will clean up nicely,
you can use the "lamclean" command to clean out the current LAM
universe. Something like this (pseudocode):

        lamboot ...
        while (...) do
                mpirun ...
                lamclean
        done
        lamhalt

As for your specific problems, I'm guessing that what is happening here
is that you are falling victim to the lamhalt command. One thing that
is not generally recognized is that the lamhalt command actually takes
1-2 seconds to run. That is, it launches a "please kill me" process
and then returns. The "please kill me process" actually kills the LAM
universe 1-2 seconds later. So it's quite possible that lamhalt can
return and your LAM universe is still running.

In this case, I'm guessing that you run lamhalt and then more-or-less
immediately run lamboot again. Part way through this lamboot, the
lamhalt "please kill me" process kicks in and kills your newly-launched
lamd, which causes the error that you're seeing (lamboot is unable to
contact one of the newly-launched lamds). There are several ways to
fix this problem:

1. Use lamwipe instead of lamhalt. Not really recommended, but it is
an option, and it doesn't have the delay issue discussed above.

2. Put a "sleep 5" in your script between the lamhalt and the next
iteration of the loop. This should be far more than enough time to
ensure that all the "please kill me" processes are done and gone.

3. If possible, as mentioned above, don't run lamboot/lamhalt more than
once. Instead, simply run mpirun (and possibly lamclean) as many times
as you need, and only run lamboot/lamhalt once.

On May 31, 2005, at 6:21 PM, Lars Grabow wrote:

> Hello,
>  
> I’ve been using LAM 7.0.3 and now 7.1.1 quite successfully for the
> past year.  However, there is one random problem that frequently
> occurs and I have no clue what I can do about it.  We are running jobs
> on a 48 node Linux RedHat 9.0 cluster.  There is no firewall between
> the nodes, passwordless logins work fine, etc.  We use PBS Pro/Maui as
> the scheduling system. For certain kinds of calculations, we
> repeatedly launch the same parallel executable with different input
> files within one job.  That usually works fine, but in some cases
> lamboot fails. The sequence of LAM command between the individual runs
> is:
>  
> export MPIHOME=/usr/local/lam-7.1.1
> $MPIHOME/bin/recon $MACHINEFILE
> $MPIHOME/bin/lamboot -b -d -s $MACHINEFILE
> $MPIHOME/bin/mpirun -O -ssi rpi lamd -np ${NPROCS} executable
> <arguments>
> $MPIHOME/bin/lamclean
> $MPIHOME/bin/lamhalt
>  
> As an example I have the full debug output of one job below. It
> started the process one time and finished fine. Then, when it starts
> with the second loop, lamboot fails.  If it fails, it always fails to
> connect to the master node (origin = star25 in the example below). 
> There is no option to change our script to make independent jobs since
> the input files must be created on the fly.  Does anyone have an idea
> what’s going wrong here? I appreciate any ideas!
>  
> Thank you very much,
>  
> Lars
> -----------------------------------------------------------------------
> ------
> Woo hoo!
>  
> recon has completed successfully.  This means that you will most likely
> be able to boot LAM successfully with the "lamboot" command (but this
> is not a guarantee).  See the lamboot(1) manual page for more
> information on the lamboot command.
>  
> If you have problems booting LAM (with lamboot) even though recon
> worked successfully, enable the "-d" option to lamboot to examine each
> step of lamboot and see what fails.  Most situations where recon
> succeeds and lamboot fails have to do with the hboot(1) command (that
> lamboot invokes on each host in the hostfile).
> -----------------------------------------------------------------------
> ------
> n-1<9814> ssi:boot:open: opening
> n-1<9814> ssi:boot:open: opening boot module globus
> n-1<9814> ssi:boot:open: opened boot module globus
> n-1<9814> ssi:boot:open: opening boot module rsh
> n-1<9814> ssi:boot:open: opened boot module rsh
> n-1<9814> ssi:boot:open: opening boot module slurm
> n-1<9814> ssi:boot:open: opened boot module slurm
> n-1<9814> ssi:boot:select: initializing boot module globus
> n-1<9814> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<9814> ssi:boot:select: boot module not available: globus
> n-1<9814> ssi:boot:select: initializing boot module rsh
> n-1<9814> ssi:boot:rsh: module initializing
> n-1<9814> ssi:boot:rsh:agent: ssh -x
> n-1<9814> ssi:boot:rsh:username: <same>
> n-1<9814> ssi:boot:rsh:verbose: 1000
> n-1<9814> ssi:boot:rsh:algorithm: linear
> n-1<9814> ssi:boot:rsh:no_n: 0
> n-1<9814> ssi:boot:rsh:no_profile: 0
> n-1<9814> ssi:boot:rsh:fast: 0
> n-1<9814> ssi:boot:rsh:ignore_stderr: 0
> n-1<9814> ssi:boot:rsh:priority: 10
> n-1<9814> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<9814> ssi:boot:select: initializing boot module slurm
> n-1<9814> ssi:boot:slurm: not running under SLURM
> n-1<9814> ssi:boot:select: boot module not available: slurm
> n-1<9814> ssi:boot:select: finalizing boot module globus
> n-1<9814> ssi:boot:globus: finalizing
> n-1<9814> ssi:boot:select: closing boot module globus
> n-1<9814> ssi:boot:select: finalizing boot module slurm
> n-1<9814> ssi:boot:slurm: finalizing
> n-1<9814> ssi:boot:select: closing boot module slurm
> n-1<9814> ssi:boot:select: selected boot module rsh
> n-1<9814> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<9814> ssi:boot:base:   <current directory>
> n-1<9814> ssi:boot:base:   $TROLLIUSHOME/etc
> n-1<9814> ssi:boot:base:   $LAMHOME/etc
> n-1<9814> ssi:boot:base:   /usr/local/lam-7.1.1//etc
> n-1<9814> ssi:boot:base: looking for boot schema file:
> n-1<9814> ssi:boot:base:   /var/spool/PBS/aux/9286.polaris.che.wisc.edu
> n-1<9814> ssi:boot:base: found boot schema:
> /var/spool/PBS/aux/9286.polaris.che.wisc.edu
> n-1<9814> ssi:boot:rsh: found the following hosts:
> n-1<9814> ssi:boot:rsh:   n0 star25 (cpu=1)
> n-1<9814> ssi:boot:rsh:   n1 star24 (cpu=1)
> n-1<9814> ssi:boot:rsh:   n2 star06 (cpu=1)
> n-1<9814> ssi:boot:rsh: resolved hosts:
> n-1<9814> ssi:boot:rsh:   n0 star25 --> 11.0.0.25 (origin)
> n-1<9814> ssi:boot:rsh:   n1 star24 --> 11.0.0.24
> n-1<9814> ssi:boot:rsh:   n2 star06 --> 11.0.0.6
> n-1<9814> ssi:boot:rsh: starting RTE procs
> n-1<9814> ssi:boot:base:linear: starting
> n-1<9814> ssi:boot:base:server: opening server TCP socket
> n-1<9814> ssi:boot:base:server: opened port 33789
> n-1<9814> ssi:boot:base:linear: booting n0 (star25)
> n-1<9814> ssi:boot:rsh: starting lamd on (star25)
> n-1<9814> ssi:boot:rsh: starting on n0 (star25): hboot -t -c
> lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I -H
> 11.0.0.25 -P 33789 -n 0 -o 0
> n-1<9814> ssi:boot:rsh: launching locally
> n-1<9814> ssi:boot:rsh: successfully launched on n0 (star25)
> n-1<9814> ssi:boot:base:server: expecting connection from finite list
> n-1<9814> ssi:boot:base:server: got connection from 11.0.0.25
> n-1<9814> ssi:boot:base:server: this connection is expected (n0)
> n-1<9814> ssi:boot:base:server: remote lamd is at 11.0.0.25:32831
> n-1<9814> ssi:boot:base:linear: booting n1 (star24)
> n-1<9814> ssi:boot:rsh: starting lamd on (star24)
> n-1<9814> ssi:boot:rsh: starting on n1 (star24): hboot -t -c
> lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I
> "-H
>  11.0.0.25 -P 33789 -n 1 -o 0"
> n-1<9814> ssi:boot:rsh: launching remotely
> n-1<9814> ssi:boot:rsh: attempting to execute: ssh -x star24 -n 'echo
> $SHELL'
> n-1<9814> ssi:boot:rsh: remote shell /bin/tcsh
> n-1<9814> ssi:boot:rsh: attempting to execute: ssh -x star24 -n hboot
> -t -c lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wis
> c.edu -s -I '"-H 11.0.0.25 -P 33789 -n 1 -o 0"'
> n-1<9814> ssi:boot:rsh: successfully launched on n1 (star24)
> n-1<9814> ssi:boot:base:server: expecting connection from finite list
> n-1<9814> ssi:boot:base:server: got connection from 11.0.0.24
> n-1<9814> ssi:boot:base:server: this connection is expected (n1)
> n-1<9814> ssi:boot:base:server: remote lamd is at 11.0.0.24:32793
> n-1<9814> ssi:boot:base:linear: booting n2 (star06)
> n-1<9814> ssi:boot:rsh: starting lamd on (star06)
> n-1<9814> ssi:boot:rsh: starting on n2 (star06): hboot -t -c
> lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I
> "-H
>  11.0.0.25 -P 33789 -n 2 -o 0"
> n-1<9814> ssi:boot:rsh: launching remotely
> n-1<9814> ssi:boot:rsh: attempting to execute: ssh -x star06 -n 'echo
> $SHELL'
> n-1<9814> ssi:boot:rsh: remote shell /bin/tcsh
> n-1<9814> ssi:boot:rsh: attempting to execute: ssh -x star06 -n hboot
> -t -c lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wis
> c.edu -s -I '"-H 11.0.0.25 -P 33789 -n 2 -o 0"'
> n-1<9814> ssi:boot:rsh: successfully launched on n2 (star06)
> n-1<9814> ssi:boot:base:server: expecting connection from finite list
> n-1<9814> ssi:boot:base:server: got connection from 11.0.0.6
> n-1<9814> ssi:boot:base:server: this connection is expected (n2)
> n-1<9814> ssi:boot:base:server: remote lamd is at 11.0.0.6:32828
> n-1<9814> ssi:boot:base:server: closing server socket
> n-1<9814> ssi:boot:base:server: connecting to lamd at 11.0.0.25:33790
> n-1<9814> ssi:boot:base:server: connected
> n-1<9814> ssi:boot:base:server: sending number of links (3)
> n-1<9814> ssi:boot:base:server: sending info: n0 (star25)
> n-1<9814> ssi:boot:base:server: sending info: n1 (star24)
> n-1<9814> ssi:boot:base:server: sending info: n2 (star06)
> n-1<9814> ssi:boot:base:server: finished sending
> n-1<9814> ssi:boot:base:server: disconnected from 11.0.0.25:33790
> n-1<9814> ssi:boot:base:server: connecting to lamd at 11.0.0.24:32807
> n-1<9814> ssi:boot:base:server: connected
> n-1<9814> ssi:boot:base:server: sending number of links (3)
> n-1<9814> ssi:boot:base:server: sending info: n0 (star25)
> n-1<9814> ssi:boot:base:server: sending info: n1 (star24)
> n-1<9814> ssi:boot:base:server: sending info: n2 (star06)
> n-1<9814> ssi:boot:base:server: finished sending
> n-1<9814> ssi:boot:base:server: disconnected from 11.0.0.24:32807
> n-1<9814> ssi:boot:base:server: connecting to lamd at 11.0.0.6:32870
> n-1<9814> ssi:boot:base:server: connected
> n-1<9814> ssi:boot:base:server: sending number of links (3)
> n-1<9814> ssi:boot:base:server: sending info: n0 (star25)
> n-1<9814> ssi:boot:base:server: sending info: n1 (star24)
> n-1<9814> ssi:boot:base:server: sending info: n2 (star06)
> n-1<9814> ssi:boot:base:server: finished sending
> n-1<9814> ssi:boot:base:server: disconnected from 11.0.0.6:32870
> n-1<9814> ssi:boot:base:linear: finished
> n-1<9814> ssi:boot:rsh: all RTE procs started
> n-1<9814> ssi:boot:rsh: finalizing
> n-1<9814> ssi:boot: Closing
> -----------------------------------------------------------------------
> ------
> Woo hoo!
>  
> recon has completed successfully.  This means that you will most likely
> be able to boot LAM successfully with the "lamboot" command (but this
> is not a guarantee).  See the lamboot(1) manual page for more
> information on the lamboot command.
>  
> If you have problems booting LAM (with lamboot) even though recon
> worked successfully, enable the "-d" option to lamboot to examine each
> step of lamboot and see what fails.  Most situations where recon
> succeeds and lamboot fails have to do with the hboot(1) command (that
> lamboot invokes on each host in the hostfile).
> -----------------------------------------------------------------------
> ------
> n-1<9873> ssi:boot:open: opening
> n-1<9873> ssi:boot:open: opening boot module globus
> n-1<9873> ssi:boot:open: opened boot module globus
> n-1<9873> ssi:boot:open: opening boot module rsh
> n-1<9873> ssi:boot:open: opened boot module rsh
> n-1<9873> ssi:boot:open: opening boot module slurm
> n-1<9873> ssi:boot:open: opened boot module slurm
> n-1<9873> ssi:boot:select: initializing boot module globus
> n-1<9873> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<9873> ssi:boot:select: boot module not available: globus
> n-1<9873> ssi:boot:select: initializing boot module rsh
> n-1<9873> ssi:boot:rsh: module initializing
> n-1<9873> ssi:boot:rsh:agent: ssh -x
> n-1<9873> ssi:boot:rsh:username: <same>
> n-1<9873> ssi:boot:rsh:verbose: 1000
> n-1<9873> ssi:boot:rsh:algorithm: linear
> n-1<9873> ssi:boot:rsh:no_n: 0
> n-1<9873> ssi:boot:rsh:no_profile: 0
> n-1<9873> ssi:boot:rsh:fast: 0
> n-1<9873> ssi:boot:rsh:ignore_stderr: 0
> n-1<9873> ssi:boot:rsh:priority: 10
> n-1<9873> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<9873> ssi:boot:select: initializing boot module slurm
> n-1<9873> ssi:boot:slurm: not running under SLURM
> n-1<9873> ssi:boot:select: boot module not available: slurm
> n-1<9873> ssi:boot:select: finalizing boot module globus
> n-1<9873> ssi:boot:globus: finalizing
> n-1<9873> ssi:boot:select: closing boot module globus
> n-1<9873> ssi:boot:select: finalizing boot module slurm
> n-1<9873> ssi:boot:slurm: finalizing
> n-1<9873> ssi:boot:select: closing boot module slurm
> n-1<9873> ssi:boot:select: selected boot module rsh
> n-1<9873> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<9873> ssi:boot:base:   <current directory>
> n-1<9873> ssi:boot:base:   $TROLLIUSHOME/etc
> n-1<9873> ssi:boot:base:   $LAMHOME/etc
> n-1<9873> ssi:boot:base:   /usr/local/lam-7.1.1//etc
> n-1<9873> ssi:boot:base: looking for boot schema file:
> n-1<9873> ssi:boot:base:   /var/spool/PBS/aux/9286.polaris.che.wisc.edu
> n-1<9873> ssi:boot:base: found boot schema:
> /var/spool/PBS/aux/9286.polaris.che.wisc.edu
> n-1<9873> ssi:boot:rsh: found the following hosts:
> n-1<9873> ssi:boot:rsh:   n0 star25 (cpu=1)
> n-1<9873> ssi:boot:rsh:   n1 star24 (cpu=1)
> n-1<9873> ssi:boot:rsh:   n2 star06 (cpu=1)
> n-1<9873> ssi:boot:rsh: resolved hosts:
> n-1<9873> ssi:boot:rsh:   n0 star25 --> 11.0.0.25 (origin)
> n-1<9873> ssi:boot:rsh:   n1 star24 --> 11.0.0.24
> n-1<9873> ssi:boot:rsh:   n2 star06 --> 11.0.0.6
> n-1<9873> ssi:boot:rsh: starting RTE procs
> n-1<9873> ssi:boot:base:linear: starting
> n-1<9873> ssi:boot:base:server: opening server TCP socket
> n-1<9873> ssi:boot:base:server: opened port 33806
> n-1<9873> ssi:boot:base:linear: booting n0 (star25)
> n-1<9873> ssi:boot:rsh: starting lamd on (star25)
> n-1<9873> ssi:boot:rsh: starting on n0 (star25): hboot -t -c
> lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I -H
> 11.0.0.25 -P 33806 -n 0 -o 0
> n-1<9873> ssi:boot:rsh: launching locally
> n-1<9873> ssi:boot:rsh: successfully launched on n0 (star25)
> n-1<9873> ssi:boot:base:server: expecting connection from finite list
> n-1<9873> ssi:boot:base:server: got connection from 11.0.0.25
> n-1<9873> ssi:boot:base:server: this connection is expected (n0)
> n-1<9873> ssi:boot:base:server: remote lamd is at 11.0.0.25:32833
> n-1<9873> ssi:boot:base:linear: booting n1 (star24)
> n-1<9873> ssi:boot:rsh: starting lamd on (star24)
> n-1<9873> ssi:boot:rsh: starting on n1 (star24): hboot -t -c
> lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I
> "-H
>  11.0.0.25 -P 33806 -n 1 -o 0"
> n-1<9873> ssi:boot:rsh: launching remotely
> n-1<9873> ssi:boot:rsh: attempting to execute: ssh -x star24 -n 'echo
> $SHELL'
> n-1<9873> ssi:boot:rsh: remote shell /bin/tcsh
> n-1<9873> ssi:boot:rsh: attempting to execute: ssh -x star24 -n hboot
> -t -c lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wis
> c.edu -s -I '"-H 11.0.0.25 -P 33806 -n 1 -o 0"'
> n-1<9873> ssi:boot:rsh: successfully launched on n1 (star24)
> n-1<9873> ssi:boot:base:server: expecting connection from finite list
> n-1<9873> ssi:boot:base:server: got connection from 11.0.0.24
> n-1<9873> ssi:boot:base:server: this connection is expected (n1)
> n-1<9873> ssi:boot:base:server: remote lamd is at 11.0.0.24:32794
> n-1<9873> ssi:boot:base:linear: booting n2 (star06)
> n-1<9873> ssi:boot:rsh: starting lamd on (star06)
> n-1<9873> ssi:boot:rsh: starting on n2 (star06): hboot -t -c
> lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wisc.edu -s -I
> "-H
>  11.0.0.25 -P 33806 -n 2 -o 0"
> n-1<9873> ssi:boot:rsh: launching remotely
> n-1<9873> ssi:boot:rsh: attempting to execute: ssh -x star06 -n 'echo
> $SHELL'
> n-1<9873> ssi:boot:rsh: remote shell /bin/tcsh
> n-1<9873> ssi:boot:rsh: attempting to execute: ssh -x star06 -n hboot
> -t -c lam-conf.lamd -d -sessionsuffix pbs-9286.polaris.che.wis
> c.edu -s -I '"-H 11.0.0.25 -P 33806 -n 2 -o 0"'
> n-1<9873> ssi:boot:rsh: successfully launched on n2 (star06)
> n-1<9873> ssi:boot:base:server: expecting connection from finite list
> n-1<9873> ssi:boot:base:server: got connection from 11.0.0.6
> n-1<9873> ssi:boot:base:server: this connection is expected (n2)
> n-1<9873> ssi:boot:base:server: remote lamd is at 11.0.0.6:32829
> n-1<9873> ssi:boot:base:server: closing server socket
> n-1<9873> ssi:boot:base:server: connecting to lamd at 11.0.0.25:33807
> -----------------------------------------------------------------------
> ------
> The lamboot agent failed to open a client socket to the newly-booted
> process at IP address 11.0.0.25, port 33807. 
>  
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>  
> Although the newly-booted process has already communicated
> successfully with the lamboot agent over other TCP sockets, this is
> the first time that the lamboot agent tried to initiate a connection
> to the newly-booted process.  As such, this may indicate:
>  
>         1. 11.0.0.25 is not the correct IP address for the machine
> where the
>            newly-booted machine was launched
>         2. There are network filters between the lamboot agent host and
>            the remote host such that communication on random TCP ports
>            is blocked
>         3. Network routing from the the local host to the remote isn't
>            properly configured (this is unlikely)
>  
> For number 1, check to ensure that 11.0.0.25 is the correct IP address
> for
> that machine.  If it is not, check the host mapping on that machine
> (e.g., /etc/hosts) to ensure that 11.0.0.25 is both reachable and is
> the by
> the host where the lamboot agent is running, and is the correct host.
>  
> For numbers 2 and 4, try to telnet to 11.0.0.25, port 33807.  You
> should get a
> "connection refused" error, which will indicate that you successfully
> connected to some machine at that IP address, and no process was
> listening on that port.  If you get any other kind of error, check
> with your system/network administrator -- it may indicate network /
> routing issues between the two hosts.
> -----------------------------------------------------------------------
> ------
> n-1<9873> ssi:boot:base:server: connecting to lamd at 11.0.0.24:32809
> n-1<9873> ssi:boot:base:server: connected
> n-1<9873> ssi:boot:base:server: sending number of links (3)
> n-1<9873> ssi:boot:base:server: sending info: n0 (star25)
> n-1<9873> ssi:boot:base:server: sending info: n1 (star24)
> n-1<9873> ssi:boot:base:server: sending info: n2 (star06)
> n-1<9873> ssi:boot:base:server: finished sending
> n-1<9873> ssi:boot:base:server: disconnected from 11.0.0.24:32809
> n-1<9873> ssi:boot:base:server: connecting to lamd at 11.0.0.6:32872
> n-1<9873> ssi:boot:base:server: connected
> n-1<9873> ssi:boot:base:server: sending number of links (3)
> n-1<9873> ssi:boot:base:server: sending info: n0 (star25)
> n-1<9873> ssi:boot:base:server: sending info: n1 (star24)
> n-1<9873> ssi:boot:base:server: sending info: n2 (star06)
> n-1<9873> ssi:boot:base:server: finished sending
> n-1<9873> ssi:boot:base:server: disconnected from 11.0.0.6:32872
> n-1<9873> ssi:boot:base:linear: aborted!
> -----------------------------------------------------------------------
> ------
> Synopsis:       lamwipe [-d] [-h] [-H] [-v] [-V] [-nn] [-np]
>                         [-prefix </lam/install/path/>] [-w <#>]
> [<bhost>]
>  
> Description:    This command has been obsoleted by the "lamhalt"
> command.
>                 You should be using that instead.  However, "lamwipe"
> can
>                 still be used to shut down a LAM universe.
>  
> Options:
>         -b      Use the faster lamwipe algorithm; will only work if
> shell
>                 on all remote nodes is same as shell on local node
>         -d      Print debugging message (implies -v)
>         -h      Print this message
>         -H      Don't print the header
>         -nn     Don't add "-n" to the remote agent command line
>         -np     Do not force the execution of $HOME/.profile on remote
>                 hosts
>         -prefix Use the LAM installation in <lam/install/path/>
>         -v      Be verbose
>         -V      Print version and exit without shutting down LAM
>         -w <#>  Lamwipe the first <#> nodes
>         <bhost> Use <bhost> as the boot schema
> -----------------------------------------------------------------------
> ------
> lamboot did NOT complete successfully
> -----------------------------------------------------------------------
> ------
> It seems that there is no lamd running on the host
> star25.galaxy.che.wisc.edu.
>  
> This indicates that the LAM/MPI runtime environment is not operating.
> The LAM/MPI runtime environment is necessary for the "mpirun" command.
>  
> Please run the "lamboot" command the start the LAM/MPI runtime
> environment.  See the LAM/MPI documentation for how to invoke
> "lamboot" across multiple machines.
> -----------------------------------------------------------------------
> ------
> -----------------------------------------------------------------------
> ------
> It seems that there is no lamd running on the host
> star25.galaxy.che.wisc.edu.
>  
> This indicates that the LAM/MPI runtime environment is not operating.
> The LAM/MPI runtime environment is necessary for the "lamclean"
> command.
>  
> Please run the "lamboot" command the start the LAM/MPI runtime
> environment.  See the LAM/MPI documentation for how to invoke
> "lamboot" across multiple machines.
> -----------------------------------------------------------------------
> ------
> -----------------------------------------------------------------------
> ------
> It seems that there is no lamd running on the host
> star25.galaxy.che.wisc.edu.
>  
> This indicates that the LAM/MPI runtime environment is not operating.
> The LAM/MPI runtime environment is necessary for the "lamhalt" command.
>  
> Please run the "lamboot" command the start the LAM/MPI runtime
> environment.  See the LAM/MPI documentation for how to invoke
> "lamboot" across multiple machines.
> -----------------------------------------------------------------------
> ------
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/