LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Michael Wheatley (michaelw_at_[hidden])
Date: 2005-02-24 08:47:22


At 02:10 PM 24/02/2005, you wrote:

>Is /usr/local/bin in the path where your script is running? I ask
>because you invoked everything by "/usr/local/bin/foo" rather than
>"foo".
>
>When using the tm boot SSI module, the local environment is directly
>copied out to the nodes (i.e., your shell startup files are not
>involved). So whatever is in your $path where the shell script runs on
>the mother superior should be in the $path on all the nodes.

Thanks, that solved problem a) now the next one has cropped up.

when I run the below script on one node qsub -l nodes=1 recon gives me a
woo hoo but on two nodes I get no woo ;( I want woo!! Scripts and failure
report form script.exx below

Cheers

Mike

Script
#! /bin/bash
PATH="/usr/local/bin:$PATH";export PATH
LAMHOME="/usr/local/bin:$LAMHOME";export LAMHOME
recon -v -d -ssi boot tm

script.e52
n-1<2991> ssi:boot:open: opening
n-1<2991> ssi:boot:open: opening boot module globus
n-1<2991> ssi:boot:open: opened boot module globus
n-1<2991> ssi:boot:open: opening boot module rsh
n-1<2991> ssi:boot:open: opened boot module rsh
n-1<2991> ssi:boot:open: opening boot module slurm
n-1<2991> ssi:boot:open: opened boot module slurm
n-1<2991> ssi:boot:open: opening boot module tm
n-1<2991> ssi:boot:open: opened boot module tm
n-1<2991> ssi:boot:select: initializing boot module tm
n-1<2991> ssi:boot:tm: module initializing
n-1<2991> ssi:boot:tm:verbose: 1000
n-1<2991> ssi:boot:tm:priority: 75
n-1<2991> ssi:boot:select: boot module available: tm, priority: 75
n-1<2991> ssi:boot:select: initializing boot module slurm
n-1<2991> ssi:boot:slurm: not running under SLURM
n-1<2991> ssi:boot:select: boot module not available: slurm
n-1<2991> ssi:boot:select: initializing boot module rsh
n-1<2991> ssi:boot:rsh: module initializing
n-1<2991> ssi:boot:rsh:agent: ssh
n-1<2991> ssi:boot:rsh:username: <same>
n-1<2991> ssi:boot:rsh:verbose: 1000
n-1<2991> ssi:boot:rsh:algorithm: linear
n-1<2991> ssi:boot:rsh:no_n: 0
n-1<2991> ssi:boot:rsh:no_profile: 0
n-1<2991> ssi:boot:rsh:fast: 0
n-1<2991> ssi:boot:rsh:ignore_stderr: 0
n-1<2991> ssi:boot:rsh:priority: 10
n-1<2991> ssi:boot:select: boot module available: rsh, priority: 10
n-1<2991> ssi:boot:select: initializing boot module globus
n-1<2991> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<2991> ssi:boot:select: boot module not available: globus
n-1<2991> ssi:boot:select: finalizing boot module slurm
n-1<2991> ssi:boot:slurm: finalizing
n-1<2991> ssi:boot:select: closing boot module slurm
n-1<2991> ssi:boot:select: finalizing boot module rsh
n-1<2991> ssi:boot:rsh: finalizing
n-1<2991> ssi:boot:select: closing boot module rsh
n-1<2991> ssi:boot:select: finalizing boot module globus
n-1<2991> ssi:boot:globus: finalizing
n-1<2991> ssi:boot:select: closing boot module globus
n-1<2991> ssi:boot:select: selected boot module tm
n-1<2991> ssi:boot:tm: found the following 2 hosts:
n-1<2991> ssi:boot:tm: n0 ws03.ceic.local (cpu=1)
n-1<2991> ssi:boot:tm: n1 WS02.ceic.local (cpu=1)
n-1<2991> ssi:boot:tm: starting RTE procs
n-1<2991> ssi:boot:base:linear_windowed: starting
n-1<2991> ssi:boot:base:linear_windowed: no startup protocol
n-1<2991> ssi:boot:base:linear_windowed: invoking linear
n-1<2991> ssi:boot:base:linear: starting
n-1<2991> ssi:boot:base:linear: booting n0 (ws03.ceic.local)
n-1<2991> ssi:boot:tm: starting recon on (ws03.ceic.local)
n-1<2991> ssi:boot:tm: starting on n0 (ws03.ceic.local):
/usr/local/bin/tkill -N
n-1<2991> ssi:boot:tm: successfully launched on n0 (ws03.ceic.local)
n-1<2991> ssi:boot:tm: waiting for completion on n0 (ws03.ceic.local)
n-1<2991> ssi:boot:base:linear: Failed to boot n0 (ws03.ceic.local)
n-1<2991> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
recon was not able to complete successfully. There can be any number
of problems that did not allow recon to work properly. You should use
the "-d" option to recon to get more information about each step that
recon attempts.

Any error message above may present a more detailed description of the
actual problem.

Here is general a list of prerequisites that *must* be fulfilled
before recon can work:

         - Each machine in the hostfile must be reachable and operational.
         - You must have an account on each machine.
         - You must be able to rsh(1) to the machine (permissions
           are typically set in the user's $HOME/.rhosts file).

         *** Sidenote: If you compiled LAM to use a remote shell program
             other than rsh (with the --with-rsh option to ./configure;
             e.g., ssh), or if you set the LAMRSH environment variable
             to an alternate remote shell program, you need to ensure
             that you can execute programs on remote nodes with no
             password. For example:

         unix% ssh -x pinky uptime
         3:09am up 211 day(s), 23:49, 2 users, load average: 0.01, 0.08, 0.10

         - The LAM executables must be locatable on each machine, using
           the shell's search path and possibly the LAMHOME environment
           variable.
         - The shell's start-up script must not print anything on standard
           error. You can take advantage of the fact that rsh(1) will
           start the shell non-interactively. The start-up script (such
           as .profile or .cshrc) can exit early in this case, before
           executing many commands relevant only to interactive sessions
           and likely to generate output.
-----------------------------------------------------------------------------
n-1<2991> ssi:boot:tm: finalizing
n-1<2991> ssi:boot: Closing

-- 
Internal Virus Database is out-of-date.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 266.1.0 - Release Date: 18/02/2005