LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-02-23 22:10:39


Is /usr/local/bin in the path where your script is running? I ask
because you invoked everything by "/usr/local/bin/foo" rather than
"foo".

When using the tm boot SSI module, the local environment is directly
copied out to the nodes (i.e., your shell startup files are not
involved). So whatever is in your $path where the shell script runs on
the mother superior should be in the $path on all the nodes.

On Feb 23, 2005, at 8:24 AM, Michael Wheatley wrote:

> Hello all,
> I have just rebuilt my lam7.1.1 for tm so I can use it in a pbs
> environment. When I run my script recon drops out and states that it
> "can't find executable for tkill" but when i run 'which tkill' it
> points to /usr/local/bin/tkill on the nodes. /usr/local is shared by
> an NFS server which also houses the pbs server and on which pbs was
> built. Presumably I've done something foolish (again). Any help in
> solving this problem would be appreciated.
>
> Mike
> #######################################################################
> ######
> details
> Mandrake 10
> Lam7.1.1
> PBS: torque 1.1.0p6
> #######################################################################
> #######
> Lam info
> LAM/MPI: 7.1.1
> Prefix: /usr/local
> Architecture: i686-pc-linux-gnu
> Configured by: michaelw
> Configured on: Thu Feb 24 10:42:38 EST 2005
> Configure host: panda.ceic.local
> Memory manager: ptmalloc2
> C bindings: yes
> C++ bindings: yes
> Fortran bindings: yes
> C compiler: icc
> C++ compiler: icpc
> Fortran compiler: g77
> Fortran symbols: double_underscore
> C profiling: yes
> C++ profiling: yes
> Fortran profiling: yes
> C++ exceptions: no
> Thread support: yes
> ROMIO support: yes
> IMPI support: no
> Debug support: no
> Purify clean: no
> SSI boot: globus (API v1.1, Module v0.6)
> SSI boot: rsh (API v1.1, Module v1.1)
> SSI boot: slurm (API v1.1, Module v1.0)
> SSI boot: tm (API v1.1, Module v1.1)
> SSI coll: lam_basic (API v1.1, Module v7.1)
> SSI coll: shmem (API v1.1, Module v1.0)
> SSI coll: smp (API v1.1, Module v1.2)
> SSI rpi: crtcp (API v1.1, Module v1.1)
> SSI rpi: lamd (API v1.0, Module v7.1)
> SSI rpi: sysv (API v1.0, Module v7.1)
> SSI rpi: tcp (API v1.0, Module v7.1)
> SSI rpi: usysv (API v1.0, Module v7.1)
> SSI cr: self (API v1.0, Module v1.0
> ****************************************************************
> my script
> #! /bin/bash
> /usr/local/bin/recon -d -ssi boot tm
> /usr/local/bin/lamboot -ssi boot tm
> /usr/local/bin/mpirun -np $NCPU /home/code/demMP
> /usr/local/bin/lamhalt
> ###################################################
> output of stderr from failed qsub -l nodes=10 script
>
> n-1<3308> ssi:boot:open: opening
> n-1<3308> ssi:boot:open: opening boot module globus
> n-1<3308> ssi:boot:open: opened boot module globus
> n-1<3308> ssi:boot:open: opening boot module rsh
> n-1<3308> ssi:boot:open: opened boot module rsh
> n-1<3308> ssi:boot:open: opening boot module slurm
> n-1<3308> ssi:boot:open: opened boot module slurm
> n-1<3308> ssi:boot:open: opening boot module tm
> n-1<3308> ssi:boot:open: opened boot module tm
> n-1<3308> ssi:boot:select: initializing boot module tm
> n-1<3308> ssi:boot:tm: module initializing
> n-1<3308> ssi:boot:tm:verbose: 1000
> n-1<3308> ssi:boot:tm:priority: 75
> n-1<3308> ssi:boot:select: boot module available: tm, priority: 75
> n-1<3308> ssi:boot:select: initializing boot module slurm
> n-1<3308> ssi:boot:slurm: not running under SLURM
> n-1<3308> ssi:boot:select: boot module not available: slurm
> n-1<3308> ssi:boot:select: initializing boot module rsh
> n-1<3308> ssi:boot:rsh: module initializing
> n-1<3308> ssi:boot:rsh:agent: ssh
> n-1<3308> ssi:boot:rsh:username: <same>
> n-1<3308> ssi:boot:rsh:verbose: 1000
> n-1<3308> ssi:boot:rsh:algorithm: linear
> n-1<3308> ssi:boot:rsh:no_n: 0
> n-1<3308> ssi:boot:rsh:no_profile: 0
> n-1<3308> ssi:boot:rsh:fast: 0
> n-1<3308> ssi:boot:rsh:ignore_stderr: 0
> n-1<3308> ssi:boot:rsh:priority: 10
> n-1<3308> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<3308> ssi:boot:select: initializing boot module globus
> n-1<3308> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<3308> ssi:boot:select: boot module not available: globus
> n-1<3308> ssi:boot:select: finalizing boot module slurm
> n-1<3308> ssi:boot:slurm: finalizing
> n-1<3308> ssi:boot:select: closing boot module slurm
> n-1<3308> ssi:boot:select: finalizing boot module rsh
> n-1<3308> ssi:boot:rsh: finalizing
> n-1<3308> ssi:boot:select: closing boot module rsh
> n-1<3308> ssi:boot:select: finalizing boot module globus
> n-1<3308> ssi:boot:globus: finalizing
> n-1<3308> ssi:boot:select: closing boot module globus
> n-1<3308> ssi:boot:select: selected boot module tm
> n-1<3308> ssi:boot:tm: found the following 10 hosts:
> n-1<3308> ssi:boot:tm: n0 ws11.ceic.local (cpu=1)
> n-1<3308> ssi:boot:tm: n1 ws10.ceic.local (cpu=1)
> n-1<3308> ssi:boot:tm: n2 ws09.ceic.local (cpu=1)
> n-1<3308> ssi:boot:tm: n3 ws08.ceic.local (cpu=1)
> n-1<3308> ssi:boot:tm: n4 ws07.ceic.local (cpu=1)
> n-1<3308> ssi:boot:tm: n5 ws06.ceic.local (cpu=1)
> n-1<3308> ssi:boot:tm: n6 ws05.ceic.local (cpu=1)
> n-1<3308> ssi:boot:tm: n7 ws04.ceic.local (cpu=1)
> n-1<3308> ssi:boot:tm: n8 ws03.ceic.local (cpu=1)
> n-1<3308> ssi:boot:tm: n9 WS02.ceic.local (cpu=1)
> n-1<3308> ssi:boot:tm: starting RTE procs
> n-1<3308> ssi:boot:base:linear_windowed: starting
> n-1<3308> ssi:boot:base:linear_windowed: no startup protocol
> n-1<3308> ssi:boot:base:linear_windowed: invoking linear
> n-1<3308> ssi:boot:base:linear: starting
> n-1<3308> ssi:boot:base:linear: booting n0 (ws11.ceic.local)
> n-1<3308> ssi:boot:tm: starting recon on (ws11.ceic.local)
> Can't find executable for tkill
> n-1<3308> ssi:boot:base:linear: Failed to boot n0 (ws11.ceic.local)
> n-1<3308> ssi:boot:base:linear: aborted!
> -----------------------------------------------------------------------
> ------
> recon was not able to complete successfully. There can be any number
> of problems that did not allow recon to work properly. You should use
> the "-d" option to recon to get more information about each step that
> recon attempts.
>
> Any error message above may present a more detailed description of the
> actual problem.
>
> Here is general a list of prerequisites that *must* be fulfilled
> before recon can work:
>
> - Each machine in the hostfile must be reachable and
> operational.
> - You must have an account on each machine.
> - You must be able to rsh(1) to the machine (permissions
> are typically set in the user's $HOME/.rhosts file).
>
> *** Sidenote: If you compiled LAM to use a remote shell program
> other than rsh (with the --with-rsh option to ./configure;
> e.g., ssh), or if you set the LAMRSH environment variable
> to an alternate remote shell program, you need to ensure
> that you can execute programs on remote nodes with no
> password. For example:
>
> unix% ssh -x pinky uptime
> 3:09am up 211 day(s), 23:49, 2 users, load average: 0.01,
> 0.08, 0.10
>
> - The LAM executables must be locatable on each machine, using
> the shell's search path and possibly the LAMHOME environment
> variable.
> - The shell's start-up script must not print anything on
> standard
> error. You can take advantage of the fact that rsh(1) will
> start the shell non-interactively. The start-up script (such
> as .profile or .cshrc) can exit early in this case, before
> executing many commands relevant only to interactive sessions
> and likely to generate output.
> -----------------------------------------------------------------------
> ------
> n-1<3308> ssi:boot:tm: finalizing
> n-1<3308> ssi:boot: Closing
> Can't find executable for tkill
> -----------------------------------------------------------------------
> ------
> Synopsis: mpirun [options] <app>
> mpirun [options] <where> <program> [<prog args>]
>
> Description: Start an MPI application in LAM/MPI.
>
> Notes:
> [options] Zero or more of the options listed
> below
> <app> LAM/MPI appschema
> <where> List of LAM nodes and/or CPUs (examples
> below)
> <program> Must be a LAM/MPI program that either
> invokes MPI_INIT or has exactly one of
> its children invoke MPI_INIT
> <prog args> Optional list of command line arguments
> to <program>
>
> Options:
> -c <num> Run <num> copies of <program> (same as
> -np)
> -c2c Use fast library (C2C) mode
> -client <rank> <host>:<port>
> Run IMPI job; connect to the IMPI
> server <host>
> at port <port> as IMPI client number
> <rank>
> -D Change current working directory of new
> processes to the directory where the
> executable resides
> -f Do not open stdio descriptors
> -ger Turn on GER mode
> -h Print this help message
> -l Force line-buffered output
> -lamd Use LAM daemon (LAMD) mode (opposite
> of -c2c)
> -nger Turn off GER mode
> -np <num> Run <num> copies of <program> (same as
> -c)
> -nx Don't export LAM_MPI_* environment
> variables
> -O Universe is homogeneous
> -pty / -npty Use/don't use pseudo terminals when
> stdout is
> a tty
> -s <nodeid> Load <program> from node <nodeid>
> -sigs / -nsigs Catch/don't catch signals in MPI
> application
> -ssi <n> <arg> Set environment variable
> LAM_MPI_SSI_<n>=<arg>
> -toff Enable tracing with generation
> initially off
> -ton, -t Enable tracing with generation
> initially on
> -tv Launch processes under TotalView
> Debugger
> -v Be verbose
> -w / -nw Wait/don't wait for application to
> complete
> -wd <dir> Change current working directory of new
> processes to <dir>
> -x <envlist> Export environment vars in <envlist>
>
> Nodes: n<list>, e.g., n0-3,5
> CPUS: c<list>, e.g., c0-3,5
> Extras: h (local node), o (origin node), N (all nodes), C (all
> CPUs)
>
> Examples: mpirun n0-7 prog1
> Executes "prog1" on nodes 0 through 7.
>
> mpirun -lamd -x FOO=bar,DISPLAY N prog2
> Executes "prog2" on all nodes using the LAMD RPI.
> In the environment of each process, set FOO to the
> value
> "bar", and set DISPLAY to the current value.
>
> mpirun n0 N prog3
> Run "prog3" on node 0, *and* all nodes. This executes
> *2*
> copies on n0.
>
> mpirun C prog4 arg1 arg2
> Run "prog4" on each available CPU with command line
> arguments of "arg1" and "arg2". If each node has a
> CPU count of 1, the "C" is equivalent to "N". If at
> least one node has a CPU count greater than 1, LAM
> will run neighboring ranks of MPI_COMM_WORLD on that
> node. For example, if node 0 has a CPU count of 4 and
> node 1 has a CPU count of 2, "prog4" will have
> MPI_COMM_WORLD ranks 0 through 3 on n0, and ranks 4
> and 5 on n1.
>
> mpirun c0 C prog5
> Similar to the "prog3" example above, this runs "prog5"
> on CPU 0 *and* on each available CPU. This executes
> *2* copies on the node where CPU 0 is (i.e., n0).
> This is probably not a useful use of the "C" notation;
> it is only shown here for an example.
>
>
> Defaults: -c2c -w -pty -nger -nsigs
> -----------------------------------------------------------------------
> ------
> -----------------------------------------------------------------------
> ------
> It seems that there is no lamd running on the host ws11.ceic.local.
>
> This indicates that the LAM/MPI runtime environment is not operating.
> The LAM/MPI runtime environment is necessary for the "lamhalt" command.
>
> Please run the "lamboot" command the start the LAM/MPI runtime
> environment. See the LAM/MPI documentation for how to invoke
> "lamboot" across multiple machines.
> -----------------------------------------------------------------------
> ------
> #######################################################################
> ################
>
>
> --
> No virus found in this outgoing message.
> Checked by AVG Anti-Virus.
> Version: 7.0.300 / Virus Database: 266.1.0 - Release Date: 18/02/2005
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/