At 12:07 AM 26/02/2005, you wrote:
>Silly question -- is LAM installed in /usr/local on all of your PBS
>nodes?
yep, /usr/local is shared out via nfs to all nodes.
One more thing which I doubt is relevant (but if it is a thousand appologies)
pbs is compiled with gcc and lam is icc/icpc/g77
Mike
>On Feb 24, 2005, at 7:05 PM, Michael Wheatley wrote:
>
>> I just tried that sadly the same thing happened. one node works with
>>two recon reports a failure.
>>
>> Mike
>>
>>
>> At 08:56 24/02/2005 -0500, you wrote:
>>
>>Try removing the LAMHOME setting -- it's rarely ever necessary to set
>> LAMHOME (and it's not a path-like environment -- it's meant to be a
>> single directory).
>>
>>
>> On Feb 24, 2005, at 8:47 AM, Michael Wheatley wrote:
>>
>>
>>At 02:10 PM 24/02/2005, you wrote:
>>
>>
>>Is /usr/local/bin in the path where your script is running? I ask
>> because you invoked everything by "/usr/local/bin/foo" rather than
>> "foo".
>>
>> When using the tm boot SSI module, the local environment is directly
>> copied out to the nodes (i.e., your shell startup files are not
>> involved). So whatever is in your $path where the shell script runs
>> on
>> the mother superior should be in the $path on all the nodes.
>>
>> Thanks, that solved problem a) now the next one has cropped up.
>>
>> when I run the below script on one node qsub -l nodes=1 recon gives
>>me
>> a woo hoo but on two nodes I get no woo ;( I want woo!! Scripts and
>> failure report form script.exx below
>>
>> Cheers
>>
>> Mike
>>
>>
>>
>> Script
>> #! /bin/bash
>> PATH="/usr/local/bin:$PATH";export PATH
>> LAMHOME="/usr/local/bin:$LAMHOME";export LAMHOME
>> recon -v -d -ssi boot tm
>>
>> script.e52
>> n-1<2991> ssi:boot:open: opening
>> n-1<2991> ssi:boot:open: opening boot module globus
>> n-1<2991> ssi:boot:open: opened boot module globus
>> n-1<2991> ssi:boot:open: opening boot module rsh
>> n-1<2991> ssi:boot:open: opened boot module rsh
>> n-1<2991> ssi:boot:open: opening boot module slurm
>> n-1<2991> ssi:boot:open: opened boot module slurm
>> n-1<2991> ssi:boot:open: opening boot module tm
>> n-1<2991> ssi:boot:open: opened boot module tm
>> n-1<2991> ssi:boot:select: initializing boot module tm
>> n-1<2991> ssi:boot:tm: module initializing
>> n-1<2991> ssi:boot:tm:verbose: 1000
>> n-1<2991> ssi:boot:tm:priority: 75
>> n-1<2991> ssi:boot:select: boot module available: tm, priority: 75
>> n-1<2991> ssi:boot:select: initializing boot module slurm
>> n-1<2991> ssi:boot:slurm: not running under SLURM
>> n-1<2991> ssi:boot:select: boot module not available: slurm
>> n-1<2991> ssi:boot:select: initializing boot module rsh
>> n-1<2991> ssi:boot:rsh: module initializing
>> n-1<2991> ssi:boot:rsh:agent: ssh
>> n-1<2991> ssi:boot:rsh:username: <same>
>> n-1<2991> ssi:boot:rsh:verbose: 1000
>> n-1<2991> ssi:boot:rsh:algorithm: linear
>> n-1<2991> ssi:boot:rsh:no_n: 0
>> n-1<2991> ssi:boot:rsh:no_profile: 0
>> n-1<2991> ssi:boot:rsh:fast: 0
>> n-1<2991> ssi:boot:rsh:ignore_stderr: 0
>> n-1<2991> ssi:boot:rsh:priority: 10
>> n-1<2991> ssi:boot:select: boot module available: rsh, priority: 10
>> n-1<2991> ssi:boot:select: initializing boot module globus
>> n-1<2991> ssi:boot:globus: globus-job-run not found, globus boot will
>> not run
>> n-1<2991> ssi:boot:select: boot module not available: globus
>> n-1<2991> ssi:boot:select: finalizing boot module slurm
>> n-1<2991> ssi:boot:slurm: finalizing
>> n-1<2991> ssi:boot:select: closing boot module slurm
>> n-1<2991> ssi:boot:select: finalizing boot module rsh
>> n-1<2991> ssi:boot:rsh: finalizing
>> n-1<2991> ssi:boot:select: closing boot module rsh
>> n-1<2991> ssi:boot:select: finalizing boot module globus
>> n-1<2991> ssi:boot:globus: finalizing
>> n-1<2991> ssi:boot:select: closing boot module globus
>> n-1<2991> ssi:boot:select: selected boot module tm
>> n-1<2991> ssi:boot:tm: found the following 2 hosts:
>> n-1<2991> ssi:boot:tm: n0 ws03.ceic.local (cpu=1)
>> n-1<2991> ssi:boot:tm: n1 WS02.ceic.local (cpu=1)
>> n-1<2991> ssi:boot:tm: starting RTE procs
>> n-1<2991> ssi:boot:base:linear_windowed: starting
>> n-1<2991> ssi:boot:base:linear_windowed: no startup protocol
>> n-1<2991> ssi:boot:base:linear_windowed: invoking linear
>> n-1<2991> ssi:boot:base:linear: starting
>> n-1<2991> ssi:boot:base:linear: booting n0 (ws03.ceic.local)
>> n-1<2991> ssi:boot:tm: starting recon on (ws03.ceic.local)
>> n-1<2991> ssi:boot:tm: starting on n0 (ws03.ceic.local):
>> /usr/local/bin/tkill -N
>> n-1<2991> ssi:boot:tm: successfully launched on n0 (ws03.ceic.local)
>> n-1<2991> ssi:boot:tm: waiting for completion on n0 (ws03.ceic.local)
>> n-1<2991> ssi:boot:base:linear: Failed to boot n0 (ws03.ceic.local)
>> n-1<2991> ssi:boot:base:linear: aborted!
>>
>>-----------------------------------------------------------------------
>>------
>> recon was not able to complete successfully. There can be any number
>> of problems that did not allow recon to work properly. You should use
>> the "-d" option to recon to get more information about each step that
>> recon attempts.
>>
>> Any error message above may present a more detailed description of the
>> actual problem.
>>
>> Here is general a list of prerequisites that *must* be fulfilled
>> before recon can work:
>>
>> - Each machine in the hostfile must be reachable and
>> operational.
>> - You must have an account on each machine.
>> - You must be able to rsh(1) to the machine (permissions
>> are typically set in the user's $HOME/.rhosts file).
>>
>> *** Sidenote: If you compiled LAM to use a remote shell
>>program
>> other than rsh (with the --with-rsh option to ./configure;
>> e.g., ssh), or if you set the LAMRSH environment variable
>> to an alternate remote shell program, you need to ensure
>> that you can execute programs on remote nodes with no
>> password. For example:
>>
>> unix% ssh -x pinky uptime
>> 3:09am up 211 day(s), 23:49, 2 users, load average: 0.01,
>> 0.08, 0.10
>>
>> - The LAM executables must be locatable on each machine, using
>> the shell's search path and possibly the LAMHOME environment
>> variable.
>> - The shell's start-up script must not print anything on
>> standard
>> error. You can take advantage of the fact that rsh(1) will
>> start the shell non-interactively. The start-up script
>>(such
>> as .profile or .cshrc) can exit early in this case, before
>> executing many commands relevant only to interactive
>>sessions
>> and likely to generate output.
>>
>>-----------------------------------------------------------------------
>>------
>> n-1<2991> ssi:boot:tm: finalizing
>> n-1<2991> ssi:boot: Closing
>>
>>
>> --
>> Internal Virus Database is out-of-date.
>> Checked by AVG Anti-Virus.
>> Version: 7.0.300 / Virus Database: 266.1.0 - Release Date: 18/02/2005
>>
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>> --
>> {+} Jeff Squyres
>> {+} jsquyres_at_[hidden]
>> {+} http://www.lam-mpi.org/
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>> Michael Wheatley
>> Email: michaelw_at_[hidden]
>> Photoelectrochemical solar cells
>> School of Chemical Engineering and
>> Industrial Chemistry
>> Applied Science Building
>> _--_|\ University of New South Wales
>> / \ UNSW SYDNEY NSW 2052
>> \_.--. _* Australia
>> v phone ++61 2 9385 4296
>> fax ++61 2 9385 5966
>>
>>
>>_______________________________________________
>>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>--
>{+} Jeff Squyres
>{+} jsquyres_at_[hidden]
>{+} http://www.lam-mpi.org/
>
>
>_______________________________________________
>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
>
>--
>No virus found in this incoming message.
>Checked by AVG Anti-Virus.
>Version: 7.0.300 / Virus Database: 266.4.0 - Release Date: 22/02/2005
--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 266.4.0 - Release Date: 22/02/2005
|