LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-02-25 08:07:12


Silly question -- is LAM installed in /usr/local on all of your PBS
nodes?

On Feb 24, 2005, at 7:05 PM, Michael Wheatley wrote:

> I just tried that sadly the same thing happened.  one node works with
> two recon reports a failure.
>
> Mike
>
>
> At 08:56 24/02/2005 -0500, you wrote:
>
> Try removing the LAMHOME setting -- it's rarely ever necessary to set 
> LAMHOME (and it's not a path-like environment -- it's meant to be a 
> single directory).
>
>
> On Feb 24, 2005, at 8:47 AM, Michael Wheatley wrote:
>
>
> At 02:10 PM 24/02/2005, you wrote:
>
>
> Is /usr/local/bin in the path where your script is running?  I ask
> because you invoked everything by "/usr/local/bin/foo" rather than
> "foo".
>
> When using the tm boot SSI module, the local environment is directly
> copied out to the nodes (i.e., your shell startup files are not
> involved).  So whatever is in your $path where the shell script runs 
> on
> the mother superior should be in the $path on all the nodes.
>
> Thanks, that solved problem a) now the next one has cropped up.
>
> when I run the below script on one node qsub -l nodes=1 recon gives
> me 
> a woo hoo but on two nodes I get no woo ;(  I want woo!!  Scripts and 
> failure report form script.exx below
>
> Cheers
>
> Mike
>
>
>
> Script
> #! /bin/bash
> PATH="/usr/local/bin:$PATH";export PATH
> LAMHOME="/usr/local/bin:$LAMHOME";export LAMHOME
> recon -v -d -ssi boot tm
>
> script.e52
> n-1<2991> ssi:boot:open: opening
> n-1<2991> ssi:boot:open: opening boot module globus
> n-1<2991> ssi:boot:open: opened boot module globus
> n-1<2991> ssi:boot:open: opening boot module rsh
> n-1<2991> ssi:boot:open: opened boot module rsh
> n-1<2991> ssi:boot:open: opening boot module slurm
> n-1<2991> ssi:boot:open: opened boot module slurm
> n-1<2991> ssi:boot:open: opening boot module tm
> n-1<2991> ssi:boot:open: opened boot module tm
> n-1<2991> ssi:boot:select: initializing boot module tm
> n-1<2991> ssi:boot:tm: module initializing
> n-1<2991> ssi:boot:tm:verbose: 1000
> n-1<2991> ssi:boot:tm:priority: 75
> n-1<2991> ssi:boot:select: boot module available: tm, priority: 75
> n-1<2991> ssi:boot:select: initializing boot module slurm
> n-1<2991> ssi:boot:slurm: not running under SLURM
> n-1<2991> ssi:boot:select: boot module not available: slurm
> n-1<2991> ssi:boot:select: initializing boot module rsh
> n-1<2991> ssi:boot:rsh: module initializing
> n-1<2991> ssi:boot:rsh:agent: ssh
> n-1<2991> ssi:boot:rsh:username: <same>
> n-1<2991> ssi:boot:rsh:verbose: 1000
> n-1<2991> ssi:boot:rsh:algorithm: linear
> n-1<2991> ssi:boot:rsh:no_n: 0
> n-1<2991> ssi:boot:rsh:no_profile: 0
> n-1<2991> ssi:boot:rsh:fast: 0
> n-1<2991> ssi:boot:rsh:ignore_stderr: 0
> n-1<2991> ssi:boot:rsh:priority: 10
> n-1<2991> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<2991> ssi:boot:select: initializing boot module globus
> n-1<2991> ssi:boot:globus: globus-job-run not found, globus boot will 
> not run
> n-1<2991> ssi:boot:select: boot module not available: globus
> n-1<2991> ssi:boot:select: finalizing boot module slurm
> n-1<2991> ssi:boot:slurm: finalizing
> n-1<2991> ssi:boot:select: closing boot module slurm
> n-1<2991> ssi:boot:select: finalizing boot module rsh
> n-1<2991> ssi:boot:rsh: finalizing
> n-1<2991> ssi:boot:select: closing boot module rsh
> n-1<2991> ssi:boot:select: finalizing boot module globus
> n-1<2991> ssi:boot:globus: finalizing
> n-1<2991> ssi:boot:select: closing boot module globus
> n-1<2991> ssi:boot:select: selected boot module tm
> n-1<2991> ssi:boot:tm: found the following 2 hosts:
> n-1<2991> ssi:boot:tm:   n0 ws03.ceic.local (cpu=1)
> n-1<2991> ssi:boot:tm:   n1 WS02.ceic.local (cpu=1)
> n-1<2991> ssi:boot:tm: starting RTE procs
> n-1<2991> ssi:boot:base:linear_windowed: starting
> n-1<2991> ssi:boot:base:linear_windowed: no startup protocol
> n-1<2991> ssi:boot:base:linear_windowed: invoking linear
> n-1<2991> ssi:boot:base:linear: starting
> n-1<2991> ssi:boot:base:linear: booting n0 (ws03.ceic.local)
> n-1<2991> ssi:boot:tm: starting recon on (ws03.ceic.local)
> n-1<2991> ssi:boot:tm: starting on n0 (ws03.ceic.local): 
> /usr/local/bin/tkill -N
> n-1<2991> ssi:boot:tm: successfully launched on n0 (ws03.ceic.local)
> n-1<2991> ssi:boot:tm: waiting for completion on n0 (ws03.ceic.local)
> n-1<2991> ssi:boot:base:linear: Failed to boot n0 (ws03.ceic.local)
> n-1<2991> ssi:boot:base:linear: aborted!
>
> -----------------------------------------------------------------------
> ------
> recon was not able to complete successfully.  There can be any number
> of problems that did not allow recon to work properly.  You should use
> the "-d" option to recon to get more information about each step that
> recon attempts.
>
> Any error message above may present a more detailed description of the
> actual problem.
>
> Here is general a list of prerequisites that *must* be fulfilled
> before recon can work:
>
>         - Each machine in the hostfile must be reachable and 
> operational.
>         - You must have an account on each machine.
>         - You must be able to rsh(1) to the machine (permissions
>           are typically set in the user's $HOME/.rhosts file).
>
>         *** Sidenote: If you compiled LAM to use a remote shell
> program
>             other than rsh (with the --with-rsh option to ./configure;
>             e.g., ssh), or if you set the LAMRSH environment variable
>             to an alternate remote shell program, you need to ensure
>             that you can execute programs on remote nodes with no
>             password.  For example:
>
>         unix% ssh -x pinky uptime
>         3:09am up 211 day(s), 23:49, 2 users, load average: 0.01, 
> 0.08, 0.10
>
>         - The LAM executables must be locatable on each machine, using
>           the shell's search path and possibly the LAMHOME environment
>           variable.
>         - The shell's start-up script must not print anything on 
> standard
>           error.  You can take advantage of the fact that rsh(1) will
>           start the shell non-interactively.  The start-up script
> (such
>           as .profile or .cshrc) can exit early in this case, before
>           executing many commands relevant only to interactive
> sessions
>           and likely to generate output.
>
> -----------------------------------------------------------------------
> ------
> n-1<2991> ssi:boot:tm: finalizing
> n-1<2991> ssi:boot: Closing
>
>
> --
> Internal Virus Database is out-of-date.
> Checked by AVG Anti-Virus.
> Version: 7.0.300 / Virus Database: 266.1.0 - Release Date: 18/02/2005
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>  Michael Wheatley
>  Email: michaelw_at_[hidden]
>  Photoelectrochemical solar cells
>  School of Chemical Engineering and
>  Industrial Chemistry
>  Applied Science Building
>     _--_|\   University of New South Wales
>    /       \  UNSW SYDNEY NSW 2052
>    \_.--. _*  Australia
>         v   phone  ++61 2 9385 4296
>               fax      ++61 2 9385 5966
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/